You're mid-experiment. Your model is loading. Then: 'RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.69 GiB total capacity; 20.45 GiB already allocated).' You've hit the wall that every AI developer hits eventually — your local GPU doesn't have enough VRAM. Here's how to move to a cloud GPU in 10 minutes, without losing your work or paying more than you need to.
Why You're Running Out of VRAM (and What It Takes to Fix It)
GPU memory usage in AI is dominated by model weights, activations, and optimizer state. A 7B parameter model in FP16 takes ~14 GB of VRAM just for weights — before you load a single batch. Add optimizer state (AdamW doubles that to ~28 GB for training) and activations, and an RTX 3090 or 4090 with 24 GB runs out fast. The fix isn't a bigger local GPU — it's renting exactly the GPU you need, only for the hours you use it.
| What You're Trying to Do | Minimum VRAM Needed | Cheapest Cloud GPU That Fits |
|---|---|---|
| Inference: 7B model (FP16) | 14 GB | RTX 4090 (~$0.44/hr on RunPod) |
| Inference: 7B model (INT8) | 8 GB | RTX 3090 or L4 (~$0.35/hr) |
| Fine-tune 7B with QLoRA | 16 GB | RTX 4090 (~$0.44/hr) |
| Fine-tune 13B with QLoRA | 20 GB | RTX 4090 (~$0.44/hr) |
| Fine-tune 70B with QLoRA | 48 GB | A100 80GB (~$1.59/hr) |
| Inference: 70B model (INT4) | 40 GB | 2× RTX 4090 (~$0.90/hr) |
| Training: 7B full fine-tune | 80 GB | A100 80GB (~$1.59/hr) |
Step 1: Figure Out How Much VRAM You Actually Need
Before picking a cloud GPU, run this quick check locally. Add this snippet right before your OOM error to see what you're actually using:
- →import torch; print(torch.cuda.memory_summary()) — shows allocated vs reserved VRAM
- →nvidia-smi — shows current GPU memory usage across all processes
- →torch.cuda.max_memory_allocated() — peak VRAM during your last run
- →For transformers: model.get_memory_footprint() — model weights only, before training overhead
Add 20–30% headroom to whatever peak you measured — activations and optimizer state scale with batch size. That's your minimum cloud GPU VRAM target.
Step 2: Pick the Cheapest Cloud GPU That Fits
Don't default to renting an H100 or A100 because they're familiar names. For most workloads hitting local OOM errors, an RTX 4090 (24 GB, ~$0.44–0.74/hr) or A100 80GB (~$1.59–1.99/hr) are the right choices. The RTX 4090 is often 3–4× cheaper than an A100 and has the same 24 GB as your local card — but cloud providers have them without the thermal limits and power constraints of a desktop machine.
Step 3: Get Running in Under 10 Minutes
The fastest path to a cloud GPU for most developers is RunPod or Vast.ai — both have instances ready in under 2 minutes. Here's the exact workflow:
- 1.Sign up at RunPod.io or Vast.ai (takes 2 minutes, credit card required)
- 2.Choose a GPU template — PyTorch with CUDA pre-installed, matching your local environment
- 3.SSH or open Jupyter: your instance has the same Python/CUDA stack you're used to
- 4.Upload your code: git clone your repo, or use scp / rsync for local files
- 5.Install dependencies: pip install -r requirements.txt (same as local)
- 6.Run your training script or inference code — exactly as you would locally
Most developers are running their first cloud GPU job within 10–15 minutes of signing up. The instance feels identical to SSH-ing into a powerful local machine — the only difference is the GPU has enough VRAM.
Step 4: Transfer Your Model Weights and Data
If your model is from HuggingFace, this is trivial — just call from_pretrained() with the model name and it downloads automatically. For local datasets or custom checkpoints:
- →HuggingFace models: from_pretrained('meta-llama/Llama-3-8b-hf') — downloads automatically
- →Local datasets under 1GB: scp or rsync over SSH
- →Large datasets (10GB+): Upload to S3 or HuggingFace Hub first, pull from cloud instance
- →Custom checkpoints: Upload to cloud storage, or keep them in a persistent volume if you'll iterate multiple times
How Much Will It Cost?
The arithmetic is usually surprising. A 4-hour fine-tuning run on a single A100 80GB costs $6.40–$8.00 on RunPod spot pricing. A full training run that would take 2 days on your local RTX 4090 might finish in 4 hours on an 8× A100 cluster for $50–70 total. Cloud GPU time is cheap when you're only paying for exactly the hours you use.
| Workload | Cloud GPU | Estimated Time | Estimated Cost |
|---|---|---|---|
| Fine-tune LLaMA 3 8B (QLoRA, 1K examples) | 1× A100 80GB | 2–3 hours | $3–$6 |
| Fine-tune LLaMA 3 8B (full, 50K examples) | 1× A100 80GB | 8–12 hours | $13–$24 |
| Inference API: 7B model, low traffic | 1× RTX 4090 | Ongoing | $0.44/hr |
| Stable Diffusion batch generation (1000 images) | 1× RTX 4090 | 1–2 hours | $0.50–$1.50 |
| Fine-tune 70B (QLoRA) | 2× A100 80GB | 6–10 hours | $19–$40 |
Tips to Avoid Surprises
- →Set a billing alert — both RunPod and Vast.ai let you cap spending
- →Stop your instance when not using it — you're billed per minute on most platforms
- →Use spot instances for training — 40–70% cheaper, just enable checkpointing every 30 minutes
- →Persistent storage is separate from compute — save your checkpoints to a volume, not the pod's disk
- →Test with a 5-minute run at low batch size before committing to a multi-hour job