Built with Axolotl

See axolotl config

axolotl version: 0.9.2

# uv run python csub.py --name mistral -g 8 --node_type h200 -t 1d --large_shm -c "conda deactivate && cd /mloscratch/homes/vignoud/novartis-oncology/training/axolotl && uv run axolotl train mistral-small-3-24B-cpt.yml 2>&1 | tee logs_mistral.txt" --train
base_model: mistralai/Mistral-Small-24B-Instruct-2501
dataset_prepared_path: ./prepared_data/2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1
output_dir: /mloscratch/homes/vignoud/novartis-oncology/models/2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1
wandb_name: 2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1

### TOKENIZING ###
chat_template: mistral_v7_tekken
### DATASET ###
datasets: 
  - path: /mloscratch/homes/vignoud/novartis-oncology/data/training/oncology-cpt-mixture-sampled-0.1-mistral
    split: train
    type: completion
    text_column: text

test_datasets:
  - path: /mloscratch/homes/vignoud/novartis-oncology/data/training/oncology-cpt-mixture-sampled-0.1-mistral
    split: validation
    type: completion
    text_column: text

sequence_len: 8192
pretraining_sample_concatenation: false
pad_to_sequence_len: true
train_on_inputs: false
sample_packing: true
eval_sample_packing: false
special_tokens:

### MULTI_GPU ###
deepspeed: deepspeed_configs/zero3_bf16.json

### TRAINING ###

learning_rate: 0.000001
optimizer: adamw_torch
lr_scheduler: cosine
flash_attention: true
warmup_ratio: 0.1
max_grad_norm: 1.0
weight_decay: 0.0

### EPOCHS  ###
# max_steps: 10
# save_steps: 100
num_epochs: 1
evals_per_epoch: 10
saves_per_epoch: 10

### BATCH SIZE ###
gradient_checkpointing: true
gradient_accumulation_steps: 4
micro_batch_size: 2

### PRECISION ###
# Use CUDA bf16. bool or 'full' for `bf16_full_eval` to run evals in 16 bits without AMP,  or 'auto' for automatic detection.
# require >=ampere
bf16: auto
fp16: false # Use CUDA fp16
fp8: false
bfloat16: false # No AMP (automatic mixed precision) - require >=ampere
float16: false # No AMP (automatic mixed precision)
tf32: false # Use CUDA tf32 - require >=ampere
float32: false

### LOGGING ###
logging_steps: 1
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_log_model:

seed: 42


mloscratch/homes/vignoud/novartis-oncology/models/2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1

This model is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.5015

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-06
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 16
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 58
  • num_epochs: 1.0

Training results

Training Loss Epoch Step Validation Loss
1.5243 0.0017 1 1.6344
1.4572 0.1006 59 1.5336
1.5065 0.2011 118 1.5206
1.4894 0.3017 177 1.5131
1.4846 0.4022 236 1.5082
1.4366 0.5028 295 1.5053
1.5287 0.6033 354 1.5036
1.445 0.7039 413 1.5024
1.4656 0.8044 472 1.5016
1.4702 0.9050 531 1.5015

Framework versions

  • Transformers 4.51.3
  • Pytorch 2.6.0+cu124
  • Datasets 3.5.1
  • Tokenizers 0.21.1
Downloads last month
9
Safetensors
Model size
24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JulienVig/2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1

Collection including JulienVig/2025-07-22-16-28_Mistral-Small-24B-Instruct-2501-mixture-sampled_0.1

Evaluation results