Contents
Tags
VRAM is the single biggest bottleneck for running LLMs locally. Unlike system RAM, you cannot easily swap it out — and running out means the model either crashes or falls back to painfully slow CPU inference. This guide gives you exact numbers so you can match hardware to model before you buy.
Quantization compresses the model weights from full 16-bit floats down to 8, 6, 5, or 4 bits. This dramatically reduces VRAM usage at a small quality cost. A 7B model at FP16 needs ~14GB of VRAM. The same model at Q4 (4-bit) needs only ~4GB.
A rough formula: multiply the number of billions of parameters by the bytes per weight, then add ~10-20% overhead for the KV cache and runtime.
# FP16 (2 bytes per parameter)
7B model = 7,000,000,000 × 2 = ~14 GB
# Q8 (1 byte per parameter)
7B model = 7,000,000,000 × 1 = ~7 GB
# Q4 (0.5 bytes per parameter)
7B model = 7,000,000,000 × 0.5 = ~3.5 GB
+ 10% overhead = ~4 GB total
# 70B at Q4
70B model = 70,000,000,000 × 0.5 = ~35 GB
+ overhead = ~38-40 GBIf you have an Nvidia GPU, use nvidia-smi to check available VRAM before loading a model. On Apple Silicon, unified memory counts — an M3 Max with 64GB can run 70B models comfortably.
Most inference runtimes (llama.cpp, Ollama, LM Studio) will automatically offload layers to system RAM when VRAM is full. This works, but tokens-per-second drops dramatically — often 10-20x slower. For chat it's usable; for batch inference it's painful.

Rather than manually cross-referencing model sizes and VRAM specs, Runyard's Model Radar automatically matches models to your exact hardware. Enter your GPU and RAM, and it filters the model catalog to show only what will actually run — with performance estimates. Try it free at runyard.dev.
Tools
Newsletter