Gemma-4-31B Musica v1

RP/storygen/conversational tune of Gemma-4-31B-it, the second model in Musica series, following TQ3.5-27B-Musica-v1. Feel like it is a decent overall upgrade over the Qwen version, and honestly I've liked it way more than stock Gemma in its domains.

Both reasoning and non-reasoning work, though it sometimes (rarely, thankfully) might skip reasoning even if its enabled, just regen in this case. Reasoning styles also seem to work, so prefilling with <|channel>thought\n Okay, let's see will make it use DeepSeek-esque reasoning most of the time.

Really liked instruction following on this one, it's very steerable, same or better as base. Refusals are non-existent. Swipe diversity seems quite a bit better than base.

This training run was sponsored by ArliAI

Training Notes

Gemma is a MAJOR pain to train. We had to track down a working Axolotl commit (thanks to ConicCat for suggesting a working one), it didn't have the hybrid FA-SDPA support, so it was purely SDPA, which is le S L O W, so it took ~35 hours for one epoch, compared to 17 hours for two on Qwen. But it seemed to converge on ~around the same loss earlier than Qwen, so it probably didn't need more than 1 epoch.

I've used fizzAI/Kaitan-Pretokenization to pretokenize my dataset, with 8192 seqlen (I had to use lower seqlen than my usual 16384 because Gemma is slow and memory hungry to train as is), and only last turn training, to bypass a bouquet of Gemma-specific problems with training reasoning. It seems to have worked.

r64a64 LoRA, 1e-5, 1 epoch, constant w/ warmup. 35 hours on 2xRTX Pro 6000 Blackwell.

allura-forge/musica-sft-v1-gemma4-pretok - pretokenized dataset.

CometML project - training graphs and stats.

AuriAetherwiing/G4-31B-Musica-v1-lora - LoRA adapter.

Recommended Samplers

Temperature: 1
Min-P: 0.02
NSigma: 2

Don't use repetition penalties of any kind, they harm more than they do good.

Axolotl config

See Axolotl config

# =============================================================================
# BASE MODEL
# =============================================================================
base_model: /home/arli/models/gemma-4-31B-it


# =============================================================================
# PLUGINS & KERNEL OPTIMIZATIONS
# =============================================================================
plugins:
  - axolotl.integrations.liger.LigerPlugin # not sure if it works with Gemma 4 but it doesn't crash at least
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin # must have! KV cache is too expensive otherwise
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true


# =============================================================================
# QUANTIZATION
# =============================================================================
load_in_8bit: false
load_in_4bit: false


# =============================================================================
# DATASET
# =============================================================================
shuffle_merged_datasets: true
datasets:
  - path: allura-forge/musica-sft-v1-gemma4-pretok # finally, pretokenized datasets
    ds_type: parquet
    type:

dataset_prepared_path: ./last_run_prepared
val_set_size: 0


# =============================================================================
# OUTPUT & ADAPTER
# =============================================================================
output_dir: ./outputs/v1
adapter: lora
save_safetensors: true


# =============================================================================
# SEQUENCE & SAMPLE PACKING
# =============================================================================
sequence_len: 8192 # ideally 16384 but Gemma 4 31B has too expensive KV cache
sample_packing: true # DOES in fact work with SDPA
pad_to_sequence_len: false


# =============================================================================
# LORA
# =============================================================================
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj' # ... lists were too easy? We have regex now

lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false


# =============================================================================
# TRAINING HYPERPARAMETERS
# =============================================================================
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch_fused
lr_scheduler: constant_with_warmup
learning_rate: 1e-5
warmup_ratio: 0.05
max_grad_norm: 0.5
weight_decay: 0.05

# =============================================================================
# PRECISION
# =============================================================================
bf16: auto


# =============================================================================
# ATTENTION
# =============================================================================
sdp_attention: true
#flash_attention: true # Doesn't work on Gemma 4 currently
#flex_attention: true # up to 40% less memory use with compile, but slower than SDPA
#torch_compile: true # speed up, but unreliable and breaks often
#gemma4_hybrid_attn_impl: true

# =============================================================================
# LOGGING & MONITORING
# =============================================================================
use_comet: true # install comet-ml with pip and do comet login before starting
comet_project_name: musica-31b
logging_steps: 1


# =============================================================================
# CHECKPOINTING & SAVING
# =============================================================================
auto_resume_from_checkpoints: false
evals_per_epoch: 0
saves_per_epoch: 4
save_total_limit: 4

gradient_checkpointing: false
gradient_checkpointing_kwargs:
  use_reentrant: false


# =============================================================================
# FSDP
# =============================================================================
fsdp_config:
  fsdp_version: 2
  offload_params: false
  cpu_ram_efficient_loading: false
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
  state_dict_type: FULL_STATE_DICT
  sharding_strategy: FULL_SHARD
  reshard_after_forward: true
  activation_checkpointing: true