| --- |
| license: gemma |
| datasets: |
| - FreedomIntelligence/ApolloMoEDataset |
| language: |
| - ar |
| - en |
| - zh |
| - ko |
| - ja |
| - mn |
| - th |
| - vi |
| - lo |
| - mg |
| - de |
| - pt |
| - es |
| - fr |
| - ru |
| - it |
| - hr |
| - gl |
| - cs |
| - co |
| - la |
| - uk |
| - bs |
| - bg |
| - eo |
| - sq |
| - da |
| - sa |
| - gn |
| - sr |
| - sk |
| - gd |
| - lb |
| - hi |
| - ku |
| - mt |
| - he |
| - ln |
| - bm |
| - sw |
| - ig |
| - rw |
| - ha |
| metrics: |
| - accuracy |
| base_model: |
| - google/gemma-2-2b |
| pipeline_tag: question-answering |
| tags: |
| - biology |
| - medical |
| --- |
| # Democratizing Medical LLMs For Much More Languages |
|
|
| Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far. |
|
|
|
|
|
|
| <p align="center"> |
| 📃 <a href="https://arxiv.org/abs/2410.10626" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> • 🤗 <a href="https://huggingface.co/collections/FreedomIntelligence/apollomoe-and-apollo2-670ddebe3bb1ba1aebabbf2c" target="_blank">Models</a> •🌐 <a href="https://github.com/FreedomIntelligence/Apollo" target="_blank">Apollo</a> • 🌐 <a href="https://github.com/FreedomIntelligence/ApolloMoE" target="_blank">ApolloMoE</a> |
| </p> |
|
|
|
|
|
|
|  |
|
|
|
|
| ## 🌈 Update |
|
|
| * **[2024.10.15]** ApolloMoE repo is published!🎉 |
|
|
|
|
| ## Languages Coverage |
| 12 Major Languages and 38 Minor Languages |
|
|
| <details> |
| <summary>Click to view the Languages Coverage</summary> |
| |
|  |
|
|
| </details> |
|
|
|
|
| ## Architecture |
|
|
| <details> |
| <summary>Click to view the MoE routing image</summary> |
|
|
|  |
|
|
| </details> |
|
|
| ## Results |
|
|
| #### Dense |
| 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-0.5B" target="_blank">Apollo2-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-1.5B" target="_blank">Apollo2-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-2B" target="_blank">Apollo2-2B</a> |
| |
| 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-3.8B" target="_blank">Apollo2-3.8B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-7B" target="_blank">Apollo2-7B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-9B" target="_blank">Apollo2-9B</a> |
|
|
| <details> |
| <summary>Click to view the Dense Models Results</summary> |
|
|
|  |
|
|
| </details> |
|
|
|
|
| #### Post-MoE |
| 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-0.5B" target="_blank">Apollo-MoE-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-1.5B" target="_blank">Apollo-MoE-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-7B" target="_blank">Apollo-MoE-7B</a> |
|
|
| <details> |
| <summary>Click to view the Post-MoE Models Results</summary> |
|
|
|  |
|
|
| </details> |
|
|
| |
| |
|
|
| ## Usage Format |
| ##### Apollo2 |
| - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|> |
| - 2B, 9B: User:{query}\nAssistant:{response}\<eos\> |
| - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|> |
|
|
| ##### Apollo-MoE |
| - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|> |
| |
| ## Dataset & Evaluation |
|
|
| - Dataset |
| 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> |
|
|
| <details><summary>Click to expand</summary> |
|
|
|  |
| |
| - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train) |
|
|
|
|
| </details> |
|
|
| - Evaluation |
| 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> |
|
|
| <details><summary>Click to expand</summary> |
| |
| - EN: |
| - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options) |
| - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test) |
| - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper. |
| - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu) |
| - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine |
| - ZH: |
| - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test) |
| - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper |
| - Randomly sample 2,000 multiple-choice questions with single answer. |
| - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu) |
| - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology |
| - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper |
| - Randomly sample 2,000 multiple-choice questions |
|
|
|
|
| - ES: [Head_qa](https://huggingface.co/datasets/head_qa) |
| - FR: |
| - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA) |
| - [MMLU_FR] |
| - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine |
| - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi) |
| - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine |
| - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic) |
| - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine |
| - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA) |
| - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA) |
| - IT: |
| - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA) |
| - [MMLU_IT] |
| - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine |
| - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part |
| - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part |
| - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench) |
|
|
| |
|
|
|
|
| </details> |
| ## Model Download and Inference |
| We take Apollo-MoE-0.5B as an example |
| 1. Login Huggingface |
| |
| ``` |
| huggingface-cli login --token $HUGGINGFACE_TOKEN |
| ``` |
| |
| 2. Download model to local dir |
| |
| ```python |
| from huggingface_hub import snapshot_download |
| import os |
| |
| local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B') |
| snapshot_download(repo_id="FreedomIntelligence/Apollo-MoE-0.5B", local_dir=local_model_dir) |
| ``` |
| |
| 3. Inference Example |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig |
| import os |
| |
| local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B') |
| |
| model=AutoModelForCausalLM.from_pretrained(local_model_dir,trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained(local_model_dir,trust_remote_code=True) |
| generation_config = GenerationConfig.from_pretrained(local_model_dir, pad_token_id=tokenizer.pad_token_id, num_return_sequences=1, max_new_tokens=7, min_new_tokens=2, do_sample=False, temperature=1.0, top_k=50, top_p=1.0) |
| |
| inputs = tokenizer('Answer direclty.\nThe capital of Mongolia is Ulaanbaatar.\nThe capital of Iceland is Reykjavik.\nThe capital of Australia is', return_tensors='pt') |
| inputs = inputs.to(model.device) |
| pred = model.generate(**inputs,generation_config=generation_config) |
| print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) |
| ``` |
| |
| ## Results reproduction |
| <details><summary>Click to expand</summary> |
|
|
|
|
| We take Apollo2-7B or Apollo-MoE-0.5B as example |
| 1. Download Dataset for project: |
|
|
| ``` |
| bash 0.download_data.sh |
| ``` |
| |
| 2. Prepare test and dev data for specific model: |
| |
| |
| - Create test data for with special token |
| |
| ``` |
| bash 1.data_process_test&dev.sh |
| ``` |
| |
| 3. Prepare train data for specific model (Create tokenized data in advance): |
| |
| |
| - You can adjust data Training order and Training Epoch in this step |
| |
| ``` |
| bash 2.data_process_train.sh |
| ``` |
| |
| 4. Train the model |
| |
| |
| - If you want to train in Multi Nodes please refer to ./src/sft/training_config/zero_multi.yaml |
| |
|
|
| ``` |
| bash 3.single_node_train.sh |
| ``` |
| |
|
|
| 5. Evaluate your model: Generate score for benchmark |
| |
| ``` |
| bash 4.eval.sh |
| ``` |
| |
| </details> |
|
|
|
|
|
|
| ## Citation |
| Please use the following citation if you intend to use our dataset for training or evaluation: |
|
|
| ``` |
| @misc{zheng2024efficientlydemocratizingmedicalllms, |
| title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts}, |
| author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang}, |
| year={2024}, |
| eprint={2410.10626}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2410.10626}, |
| } |
| ``` |
|
|
|
|