How much VRAM do I need to run DeepSeek R1?

DeepSeek R1 comes in multiple sizes: the 7B model needs ~14 GB VRAM, 14B needs ~28 GB, 32B needs ~64 GB, and the full 70B needs ~140 GB (or ~70 GB with Q4 quantization). For DeepSeek V3 (671B), you need a multi-GPU cluster — at least 8× A100 80GB or equivalent.

What is the cheapest way to run DeepSeek R1 70B?

With Q4 quantization, DeepSeek R1 70B fits in ~40–70 GB VRAM. A single A100 80GB (available from $1.49–$2.20/hr on cloud providers) can run it. For the cheapest option, Vast.ai and RunPod offer A100 instances starting around $1.50/hr. Compare current prices on GPUHunt.

Can I run DeepSeek V3 on a single GPU?

No — DeepSeek V3 (671B parameters) requires approximately 1.3 TB of VRAM in fp16. Even with aggressive quantization (Q4), it needs ~400 GB VRAM, requiring multiple high-end GPUs (e.g. 8× H100 80GB). For single-GPU use, the DeepSeek R1 32B or 14B models are practical alternatives.

How to Run DeepSeek R1 & V3 on Cloud GPU (Cheapest Options 2025)

DeepSeek R1 and V3 have become the most talked-about open-source models of 2025 — matching or beating GPT-4o on reasoning benchmarks while being freely downloadable. The catch: running them requires serious GPU memory. Here's exactly what you need and how to run them for as little as possible on cloud GPUs.

DeepSeek Model Sizes and VRAM Requirements

Model	Parameters	VRAM (FP16)	VRAM (Q4)	Minimum GPU
DeepSeek R1 7B	7B	~14 GB	~5 GB	RTX 4090 or L4 (24 GB)
DeepSeek R1 14B	14B	~28 GB	~10 GB	A40 or L40S (48 GB) / RTX 4090 for Q4
DeepSeek R1 32B	32B	~64 GB	~22 GB	A100 80GB (Q4 fits), or 2× A40
DeepSeek R1 70B	70B	~140 GB	~40–70 GB	2× A100 80GB (Q4) or 1× H100 (Q4)
DeepSeek V3 671B	671B	~1.3 TB	~400 GB	8× H100 80GB minimum for Q4

💡 Q4 quantization cuts VRAM requirements by roughly 60% with minimal quality loss for most tasks. For DeepSeek R1 70B, Q4 brings the requirement from ~140 GB down to ~40–70 GB — fitting on a single A100 80GB.

Quantization: Your Most Important Cost Lever

Before choosing a GPU, choose your quantization level. FP16 gives maximum quality but maximum VRAM usage. INT8 cuts VRAM in half with negligible quality loss for most tasks. Q4_K_M (GGUF format via llama.cpp or Ollama) cuts VRAM by ~60–65% with small quality degradation. For DeepSeek R1's reasoning tasks, INT8 is recommended — Q4 can degrade chain-of-thought coherence on complex problems.

Cheapest Cloud GPU Options by Model Size

DeepSeek R1 7B and 14B: Consumer GPU Territory

The 7B model runs in FP16 on a single RTX 4090 (24 GB). The 14B model fits in Q4 on an RTX 4090 or comfortably in FP16 on an A40/L40S. For the cheapest option, Vast.ai RTX 4090 instances start at $0.35/hr — enough to run R1 7B at full quality or R1 14B in Q4.

DeepSeek R1 32B: A100 Territory

The 32B model in Q4 (~22 GB) fits on an RTX 4090, but for INT8 quality you need 48+ GB — an A40 or L40S. In FP16, you need an A100 80GB. RunPod and Vast.ai both offer A100 80GB from $1.49/hr spot, making a full R1 32B inference setup roughly $35–50/day.

DeepSeek R1 70B: A100 80GB with Quantization

R1 70B in Q4 quantization (~40–70 GB) fits on a single A100 80GB. This is the most popular production setup — a single A100 80GB at $1.49–2.20/hr gives you a reasoning model that rivals GPT-4 for about $36–53/day. Use vLLM or llama.cpp for serving.

Provider Pricing for DeepSeek Workloads

Provider	RTX 4090 Price	A100 80GB Price	H100 Price	Best For
Vast.ai	$0.35–0.65/hr	$1.10–1.80/hr	$1.80–2.80/hr	R1 7B–32B, lowest cost
RunPod (spot)	$0.35–0.55/hr	$0.90–1.49/hr	$1.20–2.10/hr	R1 7B–70B with checkpointing
RunPod (on-demand)	$0.74/hr	$1.89/hr	$2.79/hr	Production inference
Lambda Labs	Not available	$1.99/hr	$2.49/hr	R1 70B and V3, reliable
Hyperstack	Not available	Not available	$2.29/hr	V3 multi-GPU, EU data

Running DeepSeek V3 (671B): Multi-GPU Only

DeepSeek V3's 671B parameters require approximately 1.3 TB of VRAM in FP16. Even with aggressive Q4 quantization, you need around 400 GB — that's 5× H100 80GB GPUs minimum, or 8× for comfortable operation with KV cache. A practical setup: 8× H100 SXM5 on Lambda Labs or CoreWeave at $19.92–$22.32/hr, or $478–$535/day.

For most use cases, the DeepSeek R1 32B or 70B models offer 80–90% of V3's capability at 10–20× lower cost. Unless you specifically need V3's full 671B capability, the distilled R1 models are the practical choice.

Recommended Inference Stack

→llama.cpp / Ollama: Best for Q4/Q8 quantized models on consumer GPUs (RTX 4090). Easy setup, low overhead.
→vLLM: Best for production serving on A100/H100. Continuous batching, OpenAI-compatible API, highest throughput.
→Transformers (HuggingFace): Best for fine-tuning or custom inference pipelines. FP16/INT8 via bitsandbytes.
→TensorRT-LLM: Best for maximum H100 throughput. Complex setup but 2–3× faster than vLLM for high-concurrency serving.

Find the cheapest A100 or H100 to run DeepSeek R1Compare H100 Prices →

Browse all GPU options optimized for inference workloadsInference GPU Comparison →