/VRAM Calculator

VRAM Calculator for LLMs

Estimate VRAM for any model · any quantization · any context length

T-01
1K8K128K

Total VRAM Needed

4.7GB

Model weights: 4.69 GBKV cache (8K ctx): 0.04 GBRuntime overhead: 0.50 GB

Memory breakdown

Weights
KV Cache
Overhead

GPU Compatibility

4 GB

GTX 1050 Ti, GTX 1650

Too Tight

6 GB

RTX 3060 6GB, GTX 1660

Good

8 GB

RTX 3070, RTX 4060

Perfect

10 GB

RTX 3080 10GB

Perfect

12 GB

RTX 3060, RTX 4070

Perfect

16 GB

RTX 4060 Ti 16GB, RTX 4080, M2 Pro

Perfect

20 GB

RTX 3080 20GB

Perfect

24 GB

RTX 3090, RTX 4090, RTX 4080 Super

Perfect

32 GB

RTX 5090, M3 Max 36GB, M4 Max 36GB

Perfect

48 GB

RTX 6000 Ada

Perfect

64 GB

M1/M2 Ultra, A100 40GB×2

Perfect

80 GB

A100 80GB, H100

Perfect

96 GB

M3 Ultra, GH200

Perfect

Tip: Q2/Q3 quantization saves VRAM but noticeably reduces output quality. Q4_K_M is the recommended minimum for most use cases.

Want to see how every model scores on your specific GPU — speed, fit, context?

Try Model Radar →

How It Works

Three inputs. Instant results.

No GPU required to use this calculator — it runs entirely in your browser using validated VRAM formulae.

01

Pick your model

Choose from 22 popular LLMs or enter any custom parameter count in billions. MoE models like DeepSeek R1 671B are supported.

02

Set quantization & context

Select from Q2_K through F16 (we recommend Q4_K_M). Drag the context slider from 1K to 128K tokens to see KV cache impact.

03

See VRAM & GPU fit

Instantly see total VRAM needed, a memory breakdown bar, and a 13-tier GPU compatibility grid — Perfect, Good, Marginal, or Too Tight.

What It Calculates

Everything that consumes VRAM at inference.

⚖️

Model weights

The core cost. Calculated from (params × bpw) / 8 bytes.

🗄️

KV cache

Grows with context window size. Often overlooked — can exceed weight cost at 32K+.

🎚️

7 quant levels

Q2_K to F16. Each level's bits-per-weight precisely matched to real GGUF sizes.

🖥️

13 GPU tiers

From 4 GB GTX 1650 to 96 GB GH200 — real hardware examples for each tier.

TurboQuant KV

4× KV cache compression. See exactly how many GB you save at any context length.

🧠

MoE support

Mixture-of-Experts models only load active experts. Enter expert params for accurate VRAM.

🔧

Runtime overhead

0.5 GB added for allocator, CUDA context, and inference buffers.

📊

Memory breakdown

Visual bar showing exact split between weights, KV cache, and overhead.

TurboQuant

The context tax, solved.

At 32K+ context, the KV cache often costs more VRAM than the model weights. TurboQuant compresses the KV cache 4× using quantized joint leaky (QJL) encoding — without retraining the model.

Llama 3.1 8B · Q4 · 32K ctx

6.2 GB4.9 GB−1.3 GB

Qwen 2.5 14B · Q4 · 64K ctx

11.4 GB8.6 GB−2.8 GB

Qwen 2.5 72B · Q4 · 32K ctx

43.8 GB40.1 GB−3.7 GB
Without TurboQuant12.4 GB KV

Full KV cache at 64K context — 12.4 GB for a 14B model

With TurboQuant3.1 GB KV

4× compressed — same context at 8.6 GB, fits an 8 GB GPU

Toggle TurboQuant in the calculator above to see live VRAM savings for your exact model and context length.

Reading the Results

What the GPU labels mean.

Each GPU tier shows how comfortably the model fits. The ratio is (VRAM needed) ÷ (GPU VRAM available).

Perfect≤ 70%

Model loads fast, room for longer context and batch inference. Ideal for production use.

e.g. 5 GB model on 8 GB GPU

Good71–90%

Comfortable. Works well for most context lengths. May need to reduce context at 64K+.

e.g. 7 GB model on 8 GB GPU

Marginal91–100%

Fits but barely. Expect slower load times and limit context to 4K–8K.

e.g. 7.8 GB model on 8 GB GPU

Too Tight> 100%

Won't load — model weights alone exceed available VRAM. Need a smaller model or higher quantization.

e.g. 9 GB model on 8 GB GPU

Quick Reference

Common model VRAM at Q4_K_M · 8K context.

These are estimates using the calculator above. Enter any model to get your exact figure.

ModelParamsQ4_K_M VRAMMinimum GPUWith TurboQuant 64K
TinyLlama / Phi-3 Mini1–4B1–3 GB4 GB1–3 GB
Mistral 7B / Llama 3.1 8B7–8B4.7–5.3 GB6 GB3.8–4.5 GB
Qwen 2.5 14B / Phi-4 14B14B8.4 GB10 GB6.9 GB
Gemma 3 27B / Qwen 2.5 32B27–32B16–19 GB20 GB13–16 GB
Llama 3.1 70B / Qwen 2.5 72B70–72B40–43 GB48 GB35–38 GB
DeepSeek R1 671B (MoE)671B (37B active)22.5 GB24 GB18 GB
Llama 3.1 405B405B229 GBMulti-GPU cluster216 GB

* MoE models use only active expert params for weight VRAM. TurboQuant figures at 64K context.

Why Use This Calculator

Runs in your browser

No server, no account, no tracking. The entire calculation is client-side JavaScript — works offline once loaded.

TurboQuant-aware

The only public calculator that accounts for 4× KV cache compression from TurboQuant — see real savings at 32K+ context.

MoE models supported

DeepSeek R1, Mixtral, and Qwen3 MoE models are pre-loaded. Only active expert params are counted for weight VRAM.

FAQ

Frequently asked questions.

How much VRAM do I need to run an LLM locally?

It depends on model size and quantization. A 7B model at Q4_K_M needs ~4.7 GB. A 14B model needs ~8.4 GB. A 70B model needs ~40 GB. Use the calculator above with your exact model and context window to get a precise figure.

How much VRAM does a 7B LLM need at Q4 quantization?

A 7B model at Q4_K_M (4.5 bits/weight) requires approximately 4.2 GB for model weights + 0.5 GB overhead = ~4.7 GB at an 8K context window. This fits a 6 GB GPU with TurboQuant, or comfortably on an 8 GB GPU.

How much VRAM does a 14B LLM need at Q4 quantization?

A 14B model at Q4_K_M requires approximately 8.4 GB total at 8K context. It fits a 10 GB GPU (like an RTX 3080), or an 8 GB GPU if you use a slightly lower context window. At 64K context with TurboQuant, it fits an 8 GB GPU.

Can you run Llama 3.1 405B with 4-bit quantization?

Yes, but you need a lot of VRAM. Llama 3.1 405B at Q4_K_M requires approximately 229 GB — far beyond a single consumer GPU. You need multiple A100 80GB or H100 GPUs. For most users, 70B models are the practical maximum.

What is the VRAM difference between 7B, 13B, and 70B models?

At Q4_K_M quantization with 8K context: 7B ≈ 4.7 GB, 13B ≈ 8.0 GB, 70B ≈ 40 GB. Each step roughly doubles VRAM. The jump from 13B to 70B is steep — you need a 48 GB GPU or multi-GPU setup for 70B.

What does TurboQuant do and how much VRAM does it save?

TurboQuant compresses the KV cache — the memory that stores conversation context — by 4×. It does not reduce model weight size. At 8K context the savings are small (~0.3 GB), but at 64K context the savings are 2–8+ GB depending on model size. Toggle it in the calculator to see exact figures.

What quantization level should I use for local LLMs?

Q4_K_M is the recommended starting point — it gives a good balance of quality and VRAM savings, with minimal perceptible quality loss. Use Q5_K_M if you have headroom. Only use Q2_K or Q3_K_M if you are severely VRAM-constrained and accept noticeable quality reduction.

How do Mixture-of-Experts (MoE) models affect VRAM?

MoE models like DeepSeek R1 671B or Mixtral 8x7B have a large total parameter count, but only a fraction are "active" at inference time. The calculator uses active expert parameters for weight VRAM. DeepSeek R1 671B (37B active) at Q4_K_M needs only ~22 GB — comparable to a 32B dense model.