Running open-source LLMs like LLaMA 3, Mistral, Qwen, or Gemma is now mainstream — but GPU cloud costs can spiral quickly if you haven't matched your workload to the right GPU and provider. This guide covers which GPU you actually need for common open-source models, and where to find the cheapest options.
GPU VRAM Requirements by Model Size
| Model | VRAM (FP16) | VRAM (INT8) | VRAM (INT4/GGUF) | Minimum GPU |
|---|---|---|---|---|
| LLaMA 3 8B | 16 GB | 9 GB | 5 GB | RTX 4090 or L4 |
| LLaMA 3 70B | 140 GB | 70 GB | 35 GB | 2× A100 80GB (INT8) |
| Mistral 7B | 14 GB | 8 GB | 5 GB | RTX 4090 or L4 |
| Mistral Large (123B) | 246 GB | 123 GB | 62 GB | 4× A100 80GB |
| Qwen 2.5 72B | 144 GB | 72 GB | 36 GB | 2× A100 (INT8) |
| Gemma 2 9B | 18 GB | 10 GB | 6 GB | L4 or RTX 4090 |
| Deepseek R1 7B | 14 GB | 8 GB | 5 GB | RTX 4090 or L4 |
Cheapest GPUs for Inference by Use Case
7B–13B Models (Development & Low-Volume Production)
For models in the 7B–13B range, an RTX 4090 (24 GB VRAM) or NVIDIA L4 (24 GB) are the sweet spot. These GPUs cost $0.39–$0.89/hr on marketplace providers like RunPod and Vast.ai — often 3–5× cheaper than renting an A100. You can run LLaMA 3 8B in INT8 on a single RTX 4090 with room to spare.
70B Models (Production Serving)
70B models require either a single H100/A100 with INT4 quantization, or two A100 80GB GPUs in tensor-parallel mode. The cheapest production-grade option is typically two A100 80GB PCIe instances on Lambda Labs or RunPod, which costs roughly $2.60–$3.50/hr total — versus a single H100 at $2.49–$4.00/hr. For throughput-critical serving, the H100 wins; for cost-sensitive background tasks, dual A100 is competitive.
Cheapest Providers Ranked (April 2025)
- →Vast.ai — Marketplace model with spot-like pricing. RTX 4090 from $0.35/hr. Best for flexible, interruptible workloads.
- →Salad Cloud — Consumer GPU network. Cheapest $/TFLOP available. Suitable for batch inference with retry logic.
- →RunPod — Reliable marketplace with on-demand and spot pricing. L4 from $0.44/hr, A100 from $1.59/hr.
- →Lambda Labs — Reserved and on-demand. H100 from $2.49/hr. Excellent uptime and developer experience.
- →Hyperstack — European-focused. H100 NVL from $2.29/hr. Strong for EU data residency needs.
- →Tensordock — Budget-focused, smaller GPUs. RTX 3090 from $0.25/hr.
Tips to Minimize Cost
- 1.Use INT4/GGUF quantization for inference — it cuts VRAM requirements by ~75% with minimal quality loss on modern models
- 2.For fine-tuning, use QLoRA (bitsandbytes) — fine-tune a 70B model on a single A100 for ~$5–15
- 3.Batch your inference requests — higher batch sizes improve GPU utilization and reduce cost per token
- 4.Use spot/interruptible instances for training — save 40–70% with automatic checkpointing (Axolotl, HuggingFace Trainer handle this)
- 5.Compare GPU efficiency, not just hourly price — an L40S at $1.89/hr may serve 2× more requests/second than an A100 at $1.99/hr for inference