Fine-tuning LLaMA 3 used to require expensive compute and ML infrastructure expertise. In 2025, a full QLoRA fine-tune of LLaMA 3 8B on a custom dataset costs $3–$15 in cloud GPU time. This guide walks through exactly how to do it — GPU selection, dataset prep, training config, and what to watch for — so you get a usable fine-tuned model without burning money.
What You'll Build
By the end of this guide, you'll have a fine-tuned LLaMA 3 model adapted to your specific task — customer support, code generation, domain Q&A, classification, or instruction-following in a custom style. Cost: $5–$30 depending on dataset size and GPU. Time: 2–6 hours total (mostly waiting).
The Budget Breakdown
| Task | GPU | Provider | Est. Time | Est. Cost |
|---|---|---|---|---|
| LLaMA 3 8B QLoRA, 1K examples, 3 epochs | RTX 4090 (24GB) | Vast.ai spot | 1.5–2 hrs | $0.70–$1.50 |
| LLaMA 3 8B QLoRA, 10K examples, 3 epochs | RTX 4090 (24GB) | RunPod spot | 4–6 hrs | $2–$4 |
| LLaMA 3 8B QLoRA, 50K examples, 3 epochs | A100 80GB | RunPod spot | 8–12 hrs | $12–$20 |
| LLaMA 3 70B QLoRA, 10K examples, 3 epochs | 2× A100 80GB | Lambda Labs | 10–16 hrs | $35–$55 |
Step 1: Prepare Your Dataset
QLoRA fine-tuning works best with instruction-response pairs in a consistent format. The minimum viable dataset is 500–1,000 high-quality examples — more data matters less than data quality. Format your data as JSONL with 'instruction', 'input', and 'output' fields (Alpaca format), or 'messages' arrays in ChatML format for chat models.
- →Alpaca format: {"instruction": "...", "input": "...", "output": "..."} — works with Axolotl, LLaMA-Factory
- →ChatML format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
- →Minimum 500 examples for style transfer or simple task adaptation
- →1,000–5,000 examples for domain-specific knowledge
- →10K+ examples for complex instruction following or significant capability addition
- →Filter for quality over quantity — one bad example can hurt more than ten good ones help
Step 2: Pick the Right GPU and Cloud Provider
For LLaMA 3 8B QLoRA, an RTX 4090 (24 GB VRAM) is the sweet spot. It's 3× cheaper than an A100 and has enough VRAM for 8B QLoRA with batch size 4–8. For LLaMA 3 70B, you need 2× A100 80GB (tensor parallel) or a single H100 80GB with aggressive INT4 quantization.
Use spot instances from RunPod or Vast.ai — they're 40–70% cheaper and fine-tuning with Axolotl supports automatic checkpointing every N steps. If your spot instance is interrupted, you resume from the last checkpoint and waste at most 30 minutes of compute.
Step 3: Set Up Your Training Environment
Axolotl is the easiest framework for LLaMA 3 fine-tuning — it handles data formatting, QLoRA config, checkpointing, and logging with a simple YAML config. Start your RunPod or Vast.ai instance with the PyTorch 2.2 + CUDA 12.1 template, then:
- 1.pip install axolotl[flash-attn,deepspeed] — installs everything including QLoRA dependencies
- 2.Upload your dataset JSONL file (or reference a HuggingFace dataset ID)
- 3.Create your axolotl config YAML (model name, dataset path, LoRA rank, learning rate, epochs)
- 4.Request HuggingFace access token for LLaMA 3 (meta-llama/Meta-Llama-3-8B-Instruct requires approval)
- 5.Run: accelerate launch -m axolotl.cli.train your_config.yml
- 6.Monitor GPU utilization with nvidia-smi — should be 90%+ during training
Step 4: Key QLoRA Configuration Settings
- →LoRA rank (r): 16–64. Higher rank = more capacity but more VRAM. Start with r=32.
- →LoRA alpha: Usually 2× the rank (e.g., alpha=64 with r=32). Controls scaling of LoRA updates.
- →Target modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj — target all attention and MLP layers for best results.
- →Learning rate: 2e-4 for small datasets, 1e-4 for larger datasets. Use cosine schedule with warmup.
- →Batch size: As large as fits in VRAM. Gradient accumulation to simulate larger batches (e.g., batch_size=2, gradient_accumulation=8 = effective batch 16).
- →Epochs: 3–5 for most fine-tuning tasks. Watch validation loss — stop early if it plateaus.
Step 5: Merge and Export Your Model
After training, you have LoRA adapter weights — a small set of diff weights rather than the full model. You can use these directly with PEFT, or merge them into the base model for easier deployment. Merging produces a standard model checkpoint that works with Ollama, vLLM, or any Transformers-compatible inference stack.
What to Watch Out For
- →Overfitting: If train loss keeps dropping but val loss stops improving, stop early. Common with small datasets.
- →Catastrophic forgetting: If the model gets worse at general tasks, reduce training epochs or use a smaller LoRA rank.
- →OOM during training: Lower batch size first, then reduce LoRA rank, then switch to INT4 base model (load_in_4bit: true).
- →Slow training: Flash Attention 2 is ~2× faster than standard attention — make sure flash_attention: true is set.
- →Spot interruption: Axolotl saves checkpoints to output_dir automatically. Restart with resume_from_checkpoint: true.