When using QLoRA, even a 4GB VRAM GeForce 30x0 generation card might be capable of fine-tuning LLM models up to 3B parameters.
Training from scratch would likely be quite challenging.
For scenarios involving fine-tuning very small models like GPT-2, normal LoRA should suffice:
What you can realistically do on an RTX 3050 (4GB)
“Train GPT-2” usually means fine-tune a pretrained checkpoint
GPT-2 is a causal language model: it learns to predict the next token from previous tokens. (Hugging Face)
- Fine-tuning / continued pretraining on your own text is feasible on 4GB (especially for
distilgpt2 and gpt2).
- Training from scratch is technically possible but usually not practical for a beginner on limited compute because it needs much more data + compute than fine-tuning. (Hugging Face)
Pick the right GPT-2 size for your GPU
- Start with
distilgpt2 (smallest) to validate your pipeline.
- Then try
openai-community/gpt2 (124M). The official Transformers examples use that checkpoint. (GitHub)
Avoid gpt2-medium/large/xl on 4GB unless you use adapter methods and aggressive settings.
The core problem on 4GB: VRAM is mostly eaten by activations
Training memory is not just the weights; it’s also intermediate activations and optimizer states. The most reliable levers you have are exactly what the Transformers GPU efficiency guide highlights:
- smaller batch size
- gradient accumulation
- gradient checkpointing
- mixed precision
- optimizer choice
- (optionally) SDPA / torch.compile / torch_empty_cache_steps (Hugging Face)
You’ll use a combination of these.
Recommended path (beginner-friendly, stable): use run_clm.py
Hugging Face maintains a reference script for causal LM training: run_clm.py. It supports:
- training on Hub datasets or your own text files
- Trainer-based training
- optional streaming for huge datasets (GitHub)
Step 0 — Install essentials
The causal language modeling task guide uses:
transformers, datasets, and evaluate. (Hugging Face)
(You’ll also want accelerate for cleaner mixed precision / device handling in many setups.)
Step 1 — Confirm your pipeline works (tiny dataset, tiny settings)
The causal LM task guide explicitly recommends starting with a small slice so you can confirm everything runs before spending hours training. (Hugging Face)
Option A: use a Hub dataset (quick test)
The Transformers examples show how to fine-tune GPT-2 on WikiText-2 with run_clm.py. (GitHub)
On 4GB, modify the example to be memory-safe:
python run_clm.py \
--model_name_or_path openai-community/gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train --do_eval \
--output_dir ./out_gpt2_wikitext \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--block_size 128 \
--fp16 \
--gradient_checkpointing \
--eval_steps 500 --logging_steps 50 --save_steps 500
What each “efficiency” flag does
per_device_train_batch_size=1: smallest micro-batch to fit VRAM.
gradient_accumulation_steps=32: simulates an “effective batch size” of 32 without needing it in VRAM. Gradient accumulation is explicitly for training with larger effective batches than memory allows. (Hugging Face)
block_size=128: shorter sequences drastically reduce activation memory.
--fp16: mixed precision; saves activation memory and can speed up training. (Hugging Face)
--gradient_checkpointing: trades ~20% speed for lower activation memory. (Hugging Face)
Important: if you enable gradient checkpointing, disable KV-cache during training (it wastes memory and is incompatible in many setups). The common workaround is use_cache=False during training. (GitHub)
Step 2 — Train on your own text files (most common beginner setup)
The same examples README shows the canonical way to train on your own files: pass --train_file and --validation_file. (GitHub)
Prepare your data
For a first working run:
If your dataset is huge or disk is limited, streaming is available in Datasets and supported in run_clm.py via --streaming. (GitHub)
Run training (4GB-safe baseline)
python run_clm.py \
--model_name_or_path distilgpt2 \
--train_file ./train.txt \
--validation_file ./valid.txt \
--do_train --do_eval \
--output_dir ./out_distilgpt2_custom \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--block_size 128 \
--fp16 \
--gradient_checkpointing \
--save_total_limit 2 \
--logging_steps 50
Once this works reliably, switch distilgpt2 → openai-community/gpt2.
Settings that matter most on 4GB (practical defaults)
1) Sequence length (block_size)
- Start: 128
- If stable: 256
- 512+ is usually painful on 4GB unless everything else is extremely optimized.
2) Batch size and accumulation
- Micro-batch: 1
- Accumulation: 16–64 (start 32)
3) Mixed precision
- Use fp16 first. (Hugging Face)
(BF16 depends on GPU support; fp16 is the usual baseline.)
4) Gradient checkpointing
- Turn it on when you need memory headroom; expect slower training. (Hugging Face)
5) Optimizer memory (optional but helpful)
If you still struggle, consider 8-bit optimizers to reduce optimizer-state memory. bitsandbytes documents that 8-bit optimizer states can cut memory substantially and are most useful on memory-constrained GPUs. (Hugging Face)
GPT-2-specific gotchas (common beginner pitfalls)
1) Padding direction
GPT-2 uses absolute positional embeddings, and the docs recommend right padding. (Hugging Face)
(Using run_clm.py with concatenation into fixed blocks often reduces the need for padding altogether.)
2) “pad_token = eos_token” can break EOS behavior
A common trick is to set pad_token equal to eos_token, but multiple reports show a real failure mode: the data collator masks pad tokens in labels, so EOS may never be learned properly if EOS is treated as PAD. (Hugging Face Forums)
Practical fix: avoid padding (block packing), or use a real pad token if your tokenizer/model setup supports it.
3) Label shifting confusion (don’t double-shift)
For causal LM in Transformers, it’s normal that “labels look like input_ids” because the shift happens inside the model, not in the collator. (Hugging Face Forums)
If you manually shift labels and the model also shifts, training quality can collapse.
4) Saving/loading correctly
For PEFT/LoRA training especially, the recommended approach is save_pretrained / from_pretrained, not raw torch.save. (Hugging Face Forums)
“Efficient” training checklist (use this when you hit OOM)
- Reduce
block_size (sequence length)
- Set
per_device_train_batch_size=1
- Increase
gradient_accumulation_steps
- Enable
--fp16 (Hugging Face)
- Enable
--gradient_checkpointing and set use_cache=False (Hugging Face)
- Disable evaluation until training is stable (
--do_eval off, or eval less frequently)
- Consider an 8-bit optimizer if optimizer states are the problem (Hugging Face)
If you want “even more efficient”: LoRA / QLoRA (when it makes sense)
When LoRA helps
LoRA trains small adapter matrices instead of all weights, so optimizer/gradient memory drops. Transformers has a PEFT intro, and the HF course also covers LoRA concepts. (Hugging Face)
When QLoRA helps
QLoRA quantizes the base model to 4-bit and trains LoRA adapters on top. PEFT explicitly explains that quantized models aren’t typically trained directly, but PEFT adapters make it viable. (Hugging Face)
Hugging Face’s 4-bit/QLoRA blog gives practical rules of thumb like NF4 and double quant for memory. (Hugging Face)
For GPT-2 specifically
On 4GB, QLoRA is usually more useful if you’re trying to push beyond GPT-2 small. For GPT-2 small/distilGPT-2, the “Trainer + fp16 + checkpointing + accumulation” route is often simpler and stable first.
High-quality learning resources (papers, guides, code)
Background (why GPT-2 works)
- GPT-2 paper (OpenAI): Language Models are Unsupervised Multitask Learners (OpenAI)
- OpenAI blog post: Better language models and their implications (OpenAI)
Best “do it yourself” training references
- Transformers Causal language modeling task guide (Hugging Face)
- Transformers examples README +
run_clm.py commands (GitHub)
- Transformers GPU efficiency guide (single GPU) (Hugging Face)
- HF course: Train a causal LM from scratch (good to understand the workflow, even if you won’t do full pretraining locally) (Hugging Face)
LoRA / QLoRA and quantization
“See real training code”
nanoGPT (educational GPT training repo; shows the full minimal stack, but it targets bigger hardware for full reproduction) (GitHub)
A concrete starter recommendation for your GPU
If you want one single “likely-to-work” configuration to start:
- Model:
distilgpt2 → then openai-community/gpt2 (Hugging Face)
block_size=128
batch_size=1
grad_accum=32
fp16=True (Hugging Face)
gradient_checkpointing=True + use_cache=False (Hugging Face)
- First run: a few thousand examples (prove it works), then scale up (Hugging Face)