| # LLM Evaluation Framework |
|
|
| This directory contains tools for evaluating large language models on various benchmarks. |
|
|
| ## Overview |
|
|
| The evaluation framework supports multiple benchmark datasets across different domains: |
|
|
| - **Math**: AIME24, AIME25 (evaluation scripts provided) |
| - **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided) |
| - **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided) |
| - **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits) |
| - **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit) |
|
|
| ## Installation |
|
|
| Install required dependencies: |
|
|
| ```bash |
| pip install transformers vllm torch tqdm pandas |
| ``` |
|
|
| ## Directory Structure |
|
|
| ``` |
| evaluation/ |
| ├── inference.py # Main inference script |
| ├── arguments.py # Command-line argument definitions |
| │ |
| ├── data/ # Benchmark datasets and preprocessing |
| │ ├── benchmark.py # Dataset preprocessing functions |
| │ ├── aime24/, aime25/ # AIME competition problems |
| │ ├── gpqa/ # GPQA dataset |
| │ ├── livecodebench/ # LiveCodeBench v5 and v6 |
| │ ├── mmlu/, mmlu_pro/ # MMLU variants |
| │ ├── arena-hard-v0.1/, arena-hard-v2.0/ # Arena-Hard benchmarks |
| │ ├── ifeval/, IFBench/ # Instruction following benchmarks |
| │ └── mt_bench/ # MT-Bench data |
| │ |
| ├── eval/ # Evaluation scripts |
| │ ├── get_scores_math.py # Math benchmarks (AIME24, AIME25) |
| │ ├── get_scores_mmlu_batch.py # MMLU, MMLU-Pro evaluation |
| │ ├── get_scores_gpqa.py # GPQA evaluation |
| │ ├── get_scores_code.py # Code benchmarks (LiveCodeBench) |
| │ └── tools/ # Evaluation utilities |
| │ ├── grader.py # Math answer grading |
| │ ├── code_verifier_utils.py # Code execution and verification |
| │ └── latex2sympy/ # LaTeX to SymPy conversion |
| │ |
| ├── run.sh # Example single benchmark run |
| ├── run_local.sh # Local evaluation script |
| ├── run_all.sh # Run multiple benchmarks in parallel |
| └── README.md # This file |
| ``` |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| 1. Edit `run.sh` to configure your model and data paths |
| 2. Run the evaluation: |
|
|
| ```bash |
| bash run.sh |
| ``` |
|
|
| ### Advanced Usage |
|
|
| Run inference directly with custom parameters: |
|
|
| ```bash |
| python inference.py \ |
| --model-folder /path/to/models \ |
| --model-name your-model \ |
| --tokenizer-folder /path/to/tokenizers \ |
| --tokenizer-name your-tokenizer \ |
| --benchmark-folder /path/to/benchmarks \ |
| --eval-dataset aime24 \ |
| --temperature 0.6 \ |
| --topp 0.95 \ |
| --batch-size 2048 |
| ``` |
|
|
| We suggest following the paper config and running benchmarks with k different random seeds. |
|
|
| ### Key Arguments |
|
|
| #### Model Configuration (Required) |
| - `--model-folder`: Directory containing model weights |
| - `--model-name`: Name of the model subdirectory |
| - `--tokenizer-folder`: Directory containing tokenizer files |
| - `--tokenizer-name`: Name of the tokenizer subdirectory |
|
|
| #### Dataset Selection (Required for evaluation) |
| - `--benchmark-folder`: Root directory containing all benchmark datasets |
| - `--eval-dataset`: Name of the evaluation dataset (see supported datasets above) |
|
|
| #### Inference Parameters (Optional) |
| - `--temperature`: Sampling temperature (default: 0 for greedy decoding) |
| - `--topp`: Top-p (nucleus) sampling threshold (default: 1.0) |
| - `--topk`: Top-k sampling threshold (default: 1) |
| - `--max-output-len`: Maximum output length in tokens (default: 2048) |
| - `--batch-size`: Batch size for inference (default: 16) |
| - `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1) |
|
|
| #### Dataset Subsetting (Optional) |
| - `--start-idx`: Starting index for dataset subsetting (default: -1, disabled) |
| - `--end-idx`: Ending index for dataset subsetting (default: -1, disabled) |
|
|
| #### Other Options |
| - `--seed`: Random seed for reproducibility (default: 42) |
| - `--no-think`: Disable thinking mode (flag, thinking enabled by default) |
| - `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1) |
| - `--device-id`: Comma-separated GPU device IDs (optional) |
| - `--model-output-path`: Path to first turn output (required for mtbench_secondturn only) |
| |
| ## Supported Datasets |
| |
| - `aime24` / `aime25`: AIME competition problems |
| - `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6) |
| - `mmlu`: MMLU 5-shot evaluation |
| - `mmlu_pro`: MMLU Pro dataset |
| - `gpqa_diamond`: GPQA Diamond subset |
| - `ifeval`: IFEval instruction following |
| - `ifbench`: IFBench instruction following |
| - `arena_hard`: Arena-Hard v0.1 |
|
|
| ## Running Evaluation Scripts |
|
|
| After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory. |
|
|
| We also attach our cached generation files in the corresponding model repo for reproducibility. |
|
|
| ### Math Benchmarks (AIME24, AIME25) |
|
|
| Evaluate math problem-solving performance: |
|
|
| ```bash |
| cd eval |
| python get_scores_math.py \ |
| --modelfolder /path/to/model/outputs \ |
| --testfolder /path/to/test_benchmarks |
| ``` |
|
|
| This script: |
| - Evaluates AIME24 and AIME25 benchmarks |
| - Extracts answers from `\boxed{}` and other formats |
| - Computes accuracy with mathematical equivalence checking |
| - Reports mean accuracy and standard deviation across multiple runs |
|
|
| ### Multiple Choice (MMLU, MMLU-Pro, GPQA) |
|
|
| Evaluate MMLU and variants: |
|
|
| ```bash |
| cd eval |
| python get_scores_mmlu_batch.py \ |
| --modelfolder /path/to/model/outputs \ |
| --testfolder /path/to/test_benchmarks \ |
| --verbose # Optional: print per-category accuracy |
| ``` |
|
|
| This script evaluates: |
| - **MMLU**: Standard MMLU with 4 choices (A-D) |
| - **MMLU-Pro**: Extended version with up to 16 choices (A-P) |
|
|
| Features: |
| - Supports boxed answer format (e.g., `\boxed{A}`) |
| - Extracts letter choices from various formats (parentheses, text, etc.) |
| - Handles batch-split output files automatically |
| - Computes accuracy across all MMLU variants |
| - Optional per-category breakdown with `--verbose` flag |
|
|
| Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance: |
|
|
| ```bash |
| cd eval |
| python get_scores_gpqa.py \ |
| --modelfolder /path/to/model/outputs \ |
| --testfolder /path/to/test_benchmarks |
| ``` |
|
|
| This script: |
| - Evaluates GPQA Diamond subset |
| - Extracts answers from boxed and text formats |
| - Uses mathematical equivalence checking for complex answers |
| - Reports accuracy with standard deviation |
|
|
| ### Code Generation (LiveCodeBench) |
|
|
| Evaluate code generation performance: |
|
|
| ```bash |
| cd eval |
| python get_scores_code.py \ |
| --modelfolder /path/to/model/outputs \ |
| --testfolder /path/to/test_benchmarks |
| ``` |
|
|
| This script: |
| - Evaluates LiveCodeBench v5 and v6 |
| - Executes generated code against test cases |
| - Computes pass rate (percentage of problems solved correctly) |
| - Reports finish rate (percentage of valid code generations) |
|
|
| **Note**: Code execution requires: |
| ```bash |
| pip install numpy tqdm |
| ``` |
|
|
| ### Other Benchmarks |
|
|
| For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions: |
|
|
| - **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto) |
| - **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval) |
| - **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench) |
|
|
| These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code. |
|
|
| ## Output Format |
|
|
| Results are saved as JSONL files in: |
| ``` |
| {model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl |
| ``` |
|
|
| Each line contains: |
| - `task_id` or `question_id`: Unique identifier for the question |
| - `output`: Model's generated response |
| - `reason`: Whether reasoning was used (boolean) |
| - `reason_text`: The reasoning/thinking content (if applicable) |
| - Additional dataset-specific fields |
|
|
| ## Adding New Datasets |
|
|
| To add a new dataset: |
|
|
| 1. Add a preprocessing function in `data/benchmark.py`: |
| ```python |
| def preprocess_your_dataset(data_file): |
| """Preprocess your dataset. |
| |
| Args: |
| data_file: Path to dataset file |
| |
| Returns: |
| tuple: (prompt_list, qid_list) or just prompt_list |
| """ |
| # Your preprocessing logic |
| pass |
| ``` |
|
|
| 2. Add the dataset path argument in `arguments.py`: |
| ```python |
| group.add_argument('--your-dataset-path', type=str, default='path/to/dataset') |
| ``` |
|
|
| 3. Add the dataset case in `inference.py` in the `get_prompt_list()` function: |
| ```python |
| elif args.eval_dataset == "your_dataset": |
| from data.benchmark import preprocess_your_dataset |
| input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path) |
| prompt_list, qid_list = preprocess_your_dataset(input_datapath) |
| ``` |
|
|
| ## Notes |
|
|
| - The framework uses vLLM for efficient inference with batching and tensor parallelism support |
| - Special handling is provided for models like DeepSeek-R1 that require eager mode |
| - Thinking mode (`<think>` tags) is supported for models trained with reasoning capabilities |
| - YaRN RoPE scaling is supported for extended context lengths |
|
|
| ## License |
|
|
| See the main repository LICENSE file for licensing information. |
|
|
|
|