DeepSeek R1 and V3 have become the most talked-about open-source models of 2025 — matching or beating GPT-4o on reasoning benchmarks while being freely downloadable. The catch: running them requires serious GPU memory. Here's exactly what you need and how to run them for as little as possible on cloud GPUs.
DeepSeek Model Sizes and VRAM Requirements
| Model | Parameters | VRAM (FP16) | VRAM (Q4) | Minimum GPU |
|---|---|---|---|---|
| DeepSeek R1 7B | 7B | ~14 GB | ~5 GB | RTX 4090 or L4 (24 GB) |
| DeepSeek R1 14B | 14B | ~28 GB | ~10 GB | A40 or L40S (48 GB) / RTX 4090 for Q4 |
| DeepSeek R1 32B | 32B | ~64 GB | ~22 GB | A100 80GB (Q4 fits), or 2× A40 |
| DeepSeek R1 70B | 70B | ~140 GB | ~40–70 GB | 2× A100 80GB (Q4) or 1× H100 (Q4) |
| DeepSeek V3 671B | 671B | ~1.3 TB | ~400 GB | 8× H100 80GB minimum for Q4 |
Quantization: Your Most Important Cost Lever
Before choosing a GPU, choose your quantization level. FP16 gives maximum quality but maximum VRAM usage. INT8 cuts VRAM in half with negligible quality loss for most tasks. Q4_K_M (GGUF format via llama.cpp or Ollama) cuts VRAM by ~60–65% with small quality degradation. For DeepSeek R1's reasoning tasks, INT8 is recommended — Q4 can degrade chain-of-thought coherence on complex problems.
Cheapest Cloud GPU Options by Model Size
DeepSeek R1 7B and 14B: Consumer GPU Territory
The 7B model runs in FP16 on a single RTX 4090 (24 GB). The 14B model fits in Q4 on an RTX 4090 or comfortably in FP16 on an A40/L40S. For the cheapest option, Vast.ai RTX 4090 instances start at $0.35/hr — enough to run R1 7B at full quality or R1 14B in Q4.
DeepSeek R1 32B: A100 Territory
The 32B model in Q4 (~22 GB) fits on an RTX 4090, but for INT8 quality you need 48+ GB — an A40 or L40S. In FP16, you need an A100 80GB. RunPod and Vast.ai both offer A100 80GB from $1.49/hr spot, making a full R1 32B inference setup roughly $35–50/day.
DeepSeek R1 70B: A100 80GB with Quantization
R1 70B in Q4 quantization (~40–70 GB) fits on a single A100 80GB. This is the most popular production setup — a single A100 80GB at $1.49–2.20/hr gives you a reasoning model that rivals GPT-4 for about $36–53/day. Use vLLM or llama.cpp for serving.
Provider Pricing for DeepSeek Workloads
| Provider | RTX 4090 Price | A100 80GB Price | H100 Price | Best For |
|---|---|---|---|---|
| Vast.ai | $0.35–0.65/hr | $1.10–1.80/hr | $1.80–2.80/hr | R1 7B–32B, lowest cost |
| RunPod (spot) | $0.35–0.55/hr | $0.90–1.49/hr | $1.20–2.10/hr | R1 7B–70B with checkpointing |
| RunPod (on-demand) | $0.74/hr | $1.89/hr | $2.79/hr | Production inference |
| Lambda Labs | Not available | $1.99/hr | $2.49/hr | R1 70B and V3, reliable |
| Hyperstack | Not available | Not available | $2.29/hr | V3 multi-GPU, EU data |
Running DeepSeek V3 (671B): Multi-GPU Only
DeepSeek V3's 671B parameters require approximately 1.3 TB of VRAM in FP16. Even with aggressive Q4 quantization, you need around 400 GB — that's 5× H100 80GB GPUs minimum, or 8× for comfortable operation with KV cache. A practical setup: 8× H100 SXM5 on Lambda Labs or CoreWeave at $19.92–$22.32/hr, or $478–$535/day.
For most use cases, the DeepSeek R1 32B or 70B models offer 80–90% of V3's capability at 10–20× lower cost. Unless you specifically need V3's full 671B capability, the distilled R1 models are the practical choice.
Recommended Inference Stack
- →llama.cpp / Ollama: Best for Q4/Q8 quantized models on consumer GPUs (RTX 4090). Easy setup, low overhead.
- →vLLM: Best for production serving on A100/H100. Continuous batching, OpenAI-compatible API, highest throughput.
- →Transformers (HuggingFace): Best for fine-tuning or custom inference pipelines. FP16/INT8 via bitsandbytes.
- →TensorRT-LLM: Best for maximum H100 throughput. Complex setup but 2–3× faster than vLLM for high-concurrency serving.