← Blog/How Much VRAM Do You Need to Run Local LLMs?

March 18, 2026hardware

Runyard Team

@runyard_dev

8 min read

Contents

▸The Quick Reference Table ▸Understanding Quantization ▸How to Calculate VRAM Requirements ▸What Happens When You Run Out of VRAM ▸Best GPUs by VRAM Budget in 2026 ▸Runyard Makes This Easy

Tags

#vram#gpu#local-llm#hardware#guide

How Much VRAM Do You Need to Run Local LLMs?

GPU graphics card for running local LLMs — Your GPU's VRAM determines which AI models you can run locally.

VRAM is the single biggest bottleneck for running LLMs locally. Unlike system RAM, you cannot easily swap it out — and running out means the model either crashes or falls back to painfully slow CPU inference. This guide gives you exact numbers so you can match hardware to model before you buy.

The Quick Reference Table

▸4GB VRAM — Phi-3 Mini (3.8B), Gemma 2B, TinyLlama 1.1B at Q4
▸8GB VRAM — Llama 3.1 8B (Q4), Mistral 7B (Q4), Qwen 2.5 7B (Q4)
▸12GB VRAM — Llama 3.1 8B (Q8/FP16), CodeLlama 13B (Q4)
▸16GB VRAM — Llama 3 13B (FP16), Mixtral 8x7B (Q2-Q3), DeepSeek 16B (Q4)
▸24GB VRAM — Llama 3.1 70B (Q2-Q3), Mixtral 8x7B (Q4-Q8)
▸48GB+ VRAM — Llama 3.1 70B (FP16), Llama 3.1 405B (Q2)

VRAM Required by Model Size at Q4 Quantization

TinyLlama 1.1B

1GB

Phi-3 Mini 3.8B

2.5GB

Mistral 7B

4.5GB

Llama 3.1 8B

5GB

CodeLlama 13B

8GB

Llama 3 70B

38GB

Llama 3.1 405B

220GB

Understanding Quantization

Quantization compresses the model weights from full 16-bit floats down to 8, 6, 5, or 4 bits. This dramatically reduces VRAM usage at a small quality cost. A 7B model at FP16 needs ~14GB of VRAM. The same model at Q4 (4-bit) needs only ~4GB.

▸Q8 (8-bit) — ~99% of FP16 quality, half the VRAM. Best default choice.
▸Q5_K_M — Excellent balance. Barely noticeable quality drop vs Q8.
▸Q4_K_M — The sweet spot for most users. 4-bit with K-quant averaging.
▸Q3 — Noticeable quality degradation on reasoning tasks. Use as last resort.
▸Q2 — Only for very large models where you have no alternative.

How to Calculate VRAM Requirements

A rough formula: multiply the number of billions of parameters by the bytes per weight, then add ~10-20% overhead for the KV cache and runtime.

vram-calculatortext

# FP16 (2 bytes per parameter)
7B model  = 7,000,000,000 × 2 = ~14 GB

# Q8 (1 byte per parameter)
7B model  = 7,000,000,000 × 1 = ~7 GB

# Q4 (0.5 bytes per parameter)
7B model  = 7,000,000,000 × 0.5 = ~3.5 GB
+ 10% overhead = ~4 GB total

# 70B at Q4
70B model = 70,000,000,000 × 0.5 = ~35 GB
+ overhead = ~38-40 GB

If you have an Nvidia GPU, use nvidia-smi to check available VRAM before loading a model. On Apple Silicon, unified memory counts — an M3 Max with 64GB can run 70B models comfortably.

What Happens When You Run Out of VRAM

Most inference runtimes (llama.cpp, Ollama, LM Studio) will automatically offload layers to system RAM when VRAM is full. This works, but tokens-per-second drops dramatically — often 10-20x slower. For chat it's usable; for batch inference it's painful.

Best GPUs by VRAM Budget in 2026

▸8GB — RTX 4060 / RTX 3070. Handles all 7B models at Q4-Q8.
▸12GB — RTX 4070 / RTX 3080 12GB. Unlocks 13B models at Q4.
▸16GB — RX 7900 GRE / RTX 4080. Best value for serious local LLM work.
▸24GB — RTX 4090 / RTX 3090. The most popular "prosumer" choice.
▸48GB — RTX 6000 Ada / dual 24GB setup. For running 70B at quality.
▸64GB+ — Apple M2/M3 Max or Pro with unified memory. Surprisingly competitive.

Runyard Makes This Easy

Runyard VRAM Calculator widget inside a blog post — Runyard's VRAM Calculator at runyard.dev — drag the slider to your GPU memory and instantly see every model that fits.

Rather than manually cross-referencing model sizes and VRAM specs, Runyard's Model Radar automatically matches models to your exact hardware. Enter your GPU and RAM, and it filters the model catalog to show only what will actually run — with performance estimates. Try it free at runyard.dev.

More Posts

March 15, 2026

Best Local LLMs for Coding in 2026

March 12, 2026

Ollama vs LM Studio: Which Should You Use in 2026?

March 10, 2026

What LLMs Can You Run with 8GB VRAM?

← Back to Blog

Tools

VRAM Calculator8 GB

2 GB96 GB

Llama 3.1 8B Q8

Chat8GB

CodeLlama 13B

Code8GB

Phi-3 Medium 14B

Chat7.5GB

Gemma 2 9B

Chat5.5GB

Llama 3.1 8B

Chat5GB

LLaVA 1.6 7B

Vision5GB

Qwen 2.5 7B

Chat4.8GB

Mistral 7B

Chat4.5GB

DeepSeek Coder 6.7B

Code4.2GB

Phi-3 Mini 3.8B

Chat2.5GB

Gemma 2 2B

Chat2GB

TinyLlama 1.1B

Chat1GB

12 models fit in 8GB

Newsletter