--- license: apache-2.0 language: - en base_model: - google/gemma-4-31B-it datasets: - EVA-UNIT-01/Lilith-v0.3 - zerofata/Gemini-3.1-Pro-GLM5-Characters - zerofata/Instruct-Anime - zerofata/Anime-AMA-Prose - allura-forge/mimo-v2-pro-claude-distill-hs3 - allura-forge/doubao-seed2.0-distill-multiturn-expr-rp - Delta-Vector/Orion-Deepseek-V3-RP-Filtered - Delta-Vector/Orion-Deepseek-R1-RP-Filtered - Gryphe/ChatGPT-4o-Writing-Prompts - Gryphe/Sonnet3.5-Charcard-Roleplay - ToastyPigeon/kimi-stories-instruct - ToastyPigeon/kimi-rp-v3 - ToastyPigeon/fujin-filtered-instruct - Dxniz/Novelist-CoT pipeline_tag: image-text-to-text ---

Gemma-4-31B Musica v1

RP/storygen/conversational tune of Gemma-4-31B-it, the second model in Musica series, following TQ3.5-27B-Musica-v1. Feel like it is a decent overall upgrade over the Qwen version, and honestly I've liked it way more than stock Gemma in its domains. Both reasoning and non-reasoning work, though it sometimes (rarely, thankfully) might skip reasoning even if its enabled, just regen in this case. Reasoning styles also seem to work, so prefilling with `<|channel>thought\n Okay, let's see` will make it use DeepSeek-esque reasoning most of the time. Really liked instruction following on this one, it's very steerable, same or better as base. Refusals are non-existent. Swipe diversity seems quite a bit better than base. This training run was sponsored by [ArliAI](https://www.arliai.com/) **Training Notes** Gemma is a MAJOR pain to train. We had to track down a working Axolotl commit (thanks to ConicCat for suggesting a working one), it didn't have the hybrid FA-SDPA support, so it was purely SDPA, which is le S L O W, so it took ~35 hours for one epoch, compared to 17 hours for two on Qwen. But it seemed to converge on ~around the same loss earlier than Qwen, so it probably didn't need more than 1 epoch. I've used [fizzAI/Kaitan-Pretokenization](https://github.com/fizzAI/Kaitan-Pretokenization/) to pretokenize my dataset, with 8192 seqlen (I had to use lower seqlen than my usual 16384 because Gemma is slow and memory hungry to train as is), and only last turn training, to bypass a bouquet of Gemma-specific problems with training reasoning. It seems to have worked. r64a64 LoRA, 1e-5, 1 epoch, constant w/ warmup. 35 hours on 2xRTX Pro 6000 Blackwell. [allura-forge/musica-sft-v1-gemma4-pretok](https://huggingface.co/datasets/allura-forge/musica-sft-v1-gemma4-pretok) - pretokenized dataset. [CometML project](https://www.comet.com/aetherwiing/musica-31b/view/osc7iSFULAJH5XIliUhYynkIE/panels) - training graphs and stats. [AuriAetherwiing/G4-31B-Musica-v1-lora](https://huggingface.co/AuriAetherwiing/G4-31B-Musica-v1-lora) - LoRA adapter. **Recommended Samplers** - Temperature: 1 - Min-P: 0.02 - NSigma: 2 Don't use repetition penalties of any kind, they harm more than they do good. **Axolotl config**
See Axolotl config ```yaml # ============================================================================= # BASE MODEL # ============================================================================= base_model: /home/arli/models/gemma-4-31B-it # ============================================================================= # PLUGINS & KERNEL OPTIMIZATIONS # ============================================================================= plugins: - axolotl.integrations.liger.LigerPlugin # not sure if it works with Gemma 4 but it doesn't crash at least - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin # must have! KV cache is too expensive otherwise cut_cross_entropy: true liger_rope: true liger_rms_norm: true liger_layer_norm: true liger_glu_activation: true liger_rms_norm_gated: true # ============================================================================= # QUANTIZATION # ============================================================================= load_in_8bit: false load_in_4bit: false # ============================================================================= # DATASET # ============================================================================= shuffle_merged_datasets: true datasets: - path: allura-forge/musica-sft-v1-gemma4-pretok # finally, pretokenized datasets ds_type: parquet type: dataset_prepared_path: ./last_run_prepared val_set_size: 0 # ============================================================================= # OUTPUT & ADAPTER # ============================================================================= output_dir: ./outputs/v1 adapter: lora save_safetensors: true # ============================================================================= # SEQUENCE & SAMPLE PACKING # ============================================================================= sequence_len: 8192 # ideally 16384 but Gemma 4 31B has too expensive KV cache sample_packing: true # DOES in fact work with SDPA pad_to_sequence_len: false # ============================================================================= # LORA # ============================================================================= lora_r: 64 lora_alpha: 64 lora_dropout: 0.0 lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj' # ... lists were too easy? We have regex now lora_mlp_kernel: false lora_qkv_kernel: false lora_o_kernel: false # ============================================================================= # TRAINING HYPERPARAMETERS # ============================================================================= gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 2 optimizer: adamw_torch_fused lr_scheduler: constant_with_warmup learning_rate: 1e-5 warmup_ratio: 0.05 max_grad_norm: 0.5 weight_decay: 0.05 # ============================================================================= # PRECISION # ============================================================================= bf16: auto # ============================================================================= # ATTENTION # ============================================================================= sdp_attention: true #flash_attention: true # Doesn't work on Gemma 4 currently #flex_attention: true # up to 40% less memory use with compile, but slower than SDPA #torch_compile: true # speed up, but unreliable and breaks often #gemma4_hybrid_attn_impl: true # ============================================================================= # LOGGING & MONITORING # ============================================================================= use_comet: true # install comet-ml with pip and do comet login before starting comet_project_name: musica-31b logging_steps: 1 # ============================================================================= # CHECKPOINTING & SAVING # ============================================================================= auto_resume_from_checkpoints: false evals_per_epoch: 0 saves_per_epoch: 4 save_total_limit: 4 gradient_checkpointing: false gradient_checkpointing_kwargs: use_reentrant: false # ============================================================================= # FSDP # ============================================================================= fsdp_config: fsdp_version: 2 offload_params: false cpu_ram_efficient_loading: false auto_wrap_policy: TRANSFORMER_BASED_WRAP transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer state_dict_type: FULL_STATE_DICT sharding_strategy: FULL_SHARD reshard_after_forward: true activation_checkpointing: true ```