Authors: Ahmad Idrissi-Yaghir, Amin Dada, Henning Schäfer, Kamyar Arzideh, Giulia Baldini, Jan Trienes, Max Hasin, Jeanette Bewersdorff, Cynthia S. Schmidt, Marie Bauer, Kaleb E. Smith, Jiang Bian, Yonghui Wu, Jörg Schlötterer, Torsten Zesch, Peter A. Horn, Christin Seifert, Felix Nensa, Jens Kleesiek, Christoph M. Friedrich
Published on: April 08, 2024
Impact Score: 7.8
Arxiv code: Arxiv:2404.05694
Summary
- What is new: The research introduces continuous pre-training of German medical language models on a combination of translated public English medical data and German clinical data, a novel approach aimed at improving NLP performance in the medical field.
- Why this is important: Existing general pre-trained language models like BERT and RoBERTa struggle with specialized domains such as medicine due to unique terminologies and document structures.
- What the research proposes: Adapting these models through continuous pre-training on domain-specific data (2.4B tokens of translated medical data and 3B tokens of German clinical data).
- Results: Models pre-trained on clinical and translation-based data outperformed general domain models on German medical NLP tasks, matching or exceeding the performance of clinical models trained from scratch.
Technical Details
Technological frameworks used: Continuous pre-training
Models used: BERT, RoBERTa, German medical language models
Data used: 2.4B tokens of translated public English medical data, 3B tokens of German clinical data
Potential Impact
Healthcare technology companies, medical documentation and health information systems, AI-driven medical research firms
Want to implement this idea in a business?
We have generated a startup concept here: MediLinguaNLP.
Leave a Reply