Estimate VRAM for any model · any quantization · any context length
Total VRAM Needed
4.7GB
Memory breakdown
GPU Compatibility
4 GB
GTX 1050 Ti, GTX 1650
6 GB
RTX 3060 6GB, GTX 1660
8 GB
RTX 3070, RTX 4060
10 GB
RTX 3080 10GB
12 GB
RTX 3060, RTX 4070
16 GB
RTX 4060 Ti 16GB, RTX 4080, M2 Pro
20 GB
RTX 3080 20GB
24 GB
RTX 3090, RTX 4090, RTX 4080 Super
32 GB
RTX 5090, M3 Max 36GB, M4 Max 36GB
48 GB
RTX 6000 Ada
64 GB
M1/M2 Ultra, A100 40GB×2
80 GB
A100 80GB, H100
96 GB
M3 Ultra, GH200
Tip: Q2/Q3 quantization saves VRAM but noticeably reduces output quality. Q4_K_M is the recommended minimum for most use cases.
Want to see how every model scores on your specific GPU — speed, fit, context?
Try Model Radar →How It Works
No GPU required to use this calculator — it runs entirely in your browser using validated VRAM formulae.
Pick your model
Choose from 22 popular LLMs or enter any custom parameter count in billions. MoE models like DeepSeek R1 671B are supported.
Set quantization & context
Select from Q2_K through F16 (we recommend Q4_K_M). Drag the context slider from 1K to 128K tokens to see KV cache impact.
See VRAM & GPU fit
Instantly see total VRAM needed, a memory breakdown bar, and a 13-tier GPU compatibility grid — Perfect, Good, Marginal, or Too Tight.
What It Calculates
Model weights
The core cost. Calculated from (params × bpw) / 8 bytes.
KV cache
Grows with context window size. Often overlooked — can exceed weight cost at 32K+.
7 quant levels
Q2_K to F16. Each level's bits-per-weight precisely matched to real GGUF sizes.
13 GPU tiers
From 4 GB GTX 1650 to 96 GB GH200 — real hardware examples for each tier.
TurboQuant KV
4× KV cache compression. See exactly how many GB you save at any context length.
MoE support
Mixture-of-Experts models only load active experts. Enter expert params for accurate VRAM.
Runtime overhead
0.5 GB added for allocator, CUDA context, and inference buffers.
Memory breakdown
Visual bar showing exact split between weights, KV cache, and overhead.
At 32K+ context, the KV cache often costs more VRAM than the model weights. TurboQuant compresses the KV cache 4× using quantized joint leaky (QJL) encoding — without retraining the model.
Llama 3.1 8B · Q4 · 32K ctx
Qwen 2.5 14B · Q4 · 64K ctx
Qwen 2.5 72B · Q4 · 32K ctx
Full KV cache at 64K context — 12.4 GB for a 14B model
4× compressed — same context at 8.6 GB, fits an 8 GB GPU
Toggle TurboQuant in the calculator above to see live VRAM savings for your exact model and context length.
Reading the Results
Each GPU tier shows how comfortably the model fits. The ratio is (VRAM needed) ÷ (GPU VRAM available).
Model loads fast, room for longer context and batch inference. Ideal for production use.
e.g. 5 GB model on 8 GB GPU
Comfortable. Works well for most context lengths. May need to reduce context at 64K+.
e.g. 7 GB model on 8 GB GPU
Fits but barely. Expect slower load times and limit context to 4K–8K.
e.g. 7.8 GB model on 8 GB GPU
Won't load — model weights alone exceed available VRAM. Need a smaller model or higher quantization.
e.g. 9 GB model on 8 GB GPU
Quick Reference
These are estimates using the calculator above. Enter any model to get your exact figure.
| Model | Params | Q4_K_M VRAM | Minimum GPU | With TurboQuant 64K |
|---|---|---|---|---|
| TinyLlama / Phi-3 Mini | 1–4B | 1–3 GB | 4 GB | 1–3 GB |
| Mistral 7B / Llama 3.1 8B | 7–8B | 4.7–5.3 GB | 6 GB | 3.8–4.5 GB |
| Qwen 2.5 14B / Phi-4 14B | 14B | 8.4 GB | 10 GB | 6.9 GB |
| Gemma 3 27B / Qwen 2.5 32B | 27–32B | 16–19 GB | 20 GB | 13–16 GB |
| Llama 3.1 70B / Qwen 2.5 72B | 70–72B | 40–43 GB | 48 GB | 35–38 GB |
| DeepSeek R1 671B (MoE) | 671B (37B active) | 22.5 GB | 24 GB | 18 GB |
| Llama 3.1 405B | 405B | 229 GB | Multi-GPU cluster | 216 GB |
* MoE models use only active expert params for weight VRAM. TurboQuant figures at 64K context.
Why Use This Calculator
Runs in your browser
No server, no account, no tracking. The entire calculation is client-side JavaScript — works offline once loaded.
TurboQuant-aware
The only public calculator that accounts for 4× KV cache compression from TurboQuant — see real savings at 32K+ context.
MoE models supported
DeepSeek R1, Mixtral, and Qwen3 MoE models are pre-loaded. Only active expert params are counted for weight VRAM.
FAQ
It depends on model size and quantization. A 7B model at Q4_K_M needs ~4.7 GB. A 14B model needs ~8.4 GB. A 70B model needs ~40 GB. Use the calculator above with your exact model and context window to get a precise figure.
A 7B model at Q4_K_M (4.5 bits/weight) requires approximately 4.2 GB for model weights + 0.5 GB overhead = ~4.7 GB at an 8K context window. This fits a 6 GB GPU with TurboQuant, or comfortably on an 8 GB GPU.
A 14B model at Q4_K_M requires approximately 8.4 GB total at 8K context. It fits a 10 GB GPU (like an RTX 3080), or an 8 GB GPU if you use a slightly lower context window. At 64K context with TurboQuant, it fits an 8 GB GPU.
Yes, but you need a lot of VRAM. Llama 3.1 405B at Q4_K_M requires approximately 229 GB — far beyond a single consumer GPU. You need multiple A100 80GB or H100 GPUs. For most users, 70B models are the practical maximum.
At Q4_K_M quantization with 8K context: 7B ≈ 4.7 GB, 13B ≈ 8.0 GB, 70B ≈ 40 GB. Each step roughly doubles VRAM. The jump from 13B to 70B is steep — you need a 48 GB GPU or multi-GPU setup for 70B.
TurboQuant compresses the KV cache — the memory that stores conversation context — by 4×. It does not reduce model weight size. At 8K context the savings are small (~0.3 GB), but at 64K context the savings are 2–8+ GB depending on model size. Toggle it in the calculator to see exact figures.
Q4_K_M is the recommended starting point — it gives a good balance of quality and VRAM savings, with minimal perceptible quality loss. Use Q5_K_M if you have headroom. Only use Q2_K or Q3_K_M if you are severely VRAM-constrained and accept noticeable quality reduction.
MoE models like DeepSeek R1 671B or Mixtral 8x7B have a large total parameter count, but only a fraction are "active" at inference time. The calculator uses active expert parameters for weight VRAM. DeepSeek R1 671B (37B active) at Q4_K_M needs only ~22 GB — comparable to a 32B dense model.