Your System
Configure your model
Apache 2.0 · released 2024-05 · native context up to 32K tokens · HuggingFace
Standard community default — best size/quality tradeoff
Typical: 4K–8K chat, 32K–128K long-doc work. Longer context = larger KV cache.
Will it run on my hardware?
| Hardware | Memory | Fit | Headroom | Note |
|---|---|---|---|---|
| NVIDIA RTX 3060 12GB | 12 GB | Fits | +2.58 GB | Budget consumer GPU |
| NVIDIA RTX 4060 Ti 16GB | 16 GB | Fits | +6.58 GB | Mid-range with extra VRAM |
| NVIDIA RTX 3090 24GB | 24 GB | Fits | +14.6 GB | Enthusiast, great for local LLMs |
| NVIDIA RTX 4090 24GB | 24 GB | Fits | +14.6 GB | Top consumer GPU |
| NVIDIA RTX 5090 32GB | 32 GB | Fits | +22.6 GB | Current-gen flagship |
| NVIDIA A100 40GB | 40 GB | Fits | +30.6 GB | Datacenter-class |
| NVIDIA A100 80GB | 80 GB | Fits | +70.6 GB | Datacenter-class, large |
| NVIDIA H100 80GB | 80 GB | Fits | +70.6 GB | Current datacenter flagship |
| Mac M2 16GB | 12 GB(unified) | Fits | +2.58 GB | Unified memory; ~12GB usable after OS |
| Mac M3 Max 36GB | 30 GB(unified) | Fits | +20.6 GB | Unified memory; ~30GB usable |
| Mac M3 Max 64GB | 56 GB(unified) | Fits | +46.6 GB | Unified memory; ~56GB usable |
| Mac M3 Ultra 128GB | 115 GB(unified) | Fits | +105.6 GB | Unified memory; ~115GB usable |
| Mac M4 Max 128GB | 115 GB(unified) | Fits | +105.6 GB | Unified memory; ~115GB usable |
How the math works:
- Weights: params × bytes/param. FP16 = 2 bytes, Q4_K_M ≈ 0.58 bytes. For MoE, only active-expert params are loaded on the GPU per forward pass.
- KV cache: 2 × num_layers × hidden_size × context × 2 bytes (FP16). This scales linearly with context length and is often the surprise cost.
- Overhead: activations, framework workspace, CUDA kernels — typically 1.5–3 GB or ~10% of weights, whichever is larger.
- Disk: the full quantized checkpoint. MoE models store every expert on disk even if only a few are active per token.
Sources: Architecture numbers (num_hidden_layers, hidden_size, max_position_embeddings) come directly from each model's published HuggingFace config.json — click the HuggingFace link next to the selected model to see the exact config. Quantization byte/param ratios come from the llama.cpp GGUF k-quants spec. Estimates typically land within ±10% of real-world nvidia-smi usage; exact overhead depends on your runtime (llama.cpp, vLLM, TensorRT-LLM, MLX).
Quantization availability: Mathematically any model can be quantized to any level — but community GGUF releases don't always ship every variant. Q4_K_M, Q5_K_M, Q8_0 and FP16 are near-universal; Q2_K and Q3_K_M are often skipped on smaller models where quality loss is noticeable. To find a specific quant for a specific model, search bartowski, TheBloke, or the official repo on HuggingFace.
Context length: Each model has a native max context defined in its config. Going beyond it (via RoPE/YaRN scaling) works mechanically but degrades quality — the warning banner above flags this whenever the current selection exceeds the model's native limit.
How to estimate LLM requirements
- 1
Pick the model
Select the open-source model you plan to run — Llama 3.1, Mistral, Qwen 2.5, DeepSeek R1, Gemma 2, Phi, or Code Llama. The dropdown is grouped by family and shows license and release date. - 2
Choose a quantization level
Q4_K_M is the community default: about 0.58 bytes per parameter and nearly indistinguishable from FP16 on most tasks. FP16 doubles the size but matches the original training precision. Q2_K is the smallest, with visible quality loss. - 3
Set the context length
Context length drives the KV cache, which scales linearly with tokens. A 128K-context session on Llama 3.1 70B adds over 10 GB to VRAM on top of the weights. Use the preset buttons for common sizes. - 4
Read the verdict
The three cards show total VRAM (weights + KV cache + overhead), system RAM recommendation for CPU-only inference, and the disk space you need to download. The hardware table underneath marks every common GPU and Mac configuration as Fits / Tight / Too small with exact headroom.
Who this is for
Choosing a GPU for local inference
Sizing a cloud instance
Picking a quantization for your hardware
Budgeting disk space
About This Tool
This calculator estimates the VRAM, system RAM, and disk space required to run a given open-source large language model on your own hardware. It covers 30+ popular models across the Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, and Code Llama families, and every common quantization level from full FP32 down to Q2_K. Select a model, pick a quantization, set a context length, and the tool computes the exact memory footprint plus a fit verdict for every common GPU and Apple Silicon configuration.
The math is the same math llama.cpp, vLLM, and Ollama actually use. Model weight memory equals the parameter count times bytes-per-parameter, where Q4_K_M sits at about 0.58 bytes/param and FP16 at 2 bytes/param. The KV cache is 2 × num_layers × hidden_size × context × 2 bytes (it stays FP16 even when weights are quantized), which is usually the surprise cost on long-context sessions — a 128K context on a 70B model adds more than 10 GB on top of the weights. Runtime overhead (activations, CUDA workspace, framework buffers) adds another 1.5–3 GB or ~10% of weights, whichever is larger.
For MoE models like Mixtral 8x7B or DeepSeek V3, the calculator distinguishes active parameters (which determine per-token VRAM) from total parameters (which determine disk size). DeepSeek V3 has 671 B total parameters but only 37 B active per token, so it runs in a fraction of the VRAM the full number suggests — but still needs the full ~400 GB of disk for the quantized weights.
Pair this calculator with the AI Model Picker to choose a model by quality-per-cost, or with the dnpm Configurator and AI Agent Starter Guide when setting up a local development workflow. For hosted inference, the Cloudflare Cost Calculator estimates Workers AI spend for the same model at inference time.
How It Compares
Memory estimator blog posts and Hugging Face Space demos exist, but most are model-specific, skip the KV cache entirely, or ignore runtime overhead — which is exactly the memory that makes a model OOM instead of OOM-near-miss. This calculator is model-agnostic across 30+ curated checkpoints, always includes KV cache and overhead, and lets you see the headroom on 13 specific hardware targets side by side.
The alternative — trying to load the model and watching nvidia-smi or macOS Activity Monitor — works but wastes a 40 GB download when the answer is no. This tool gives you the answer before you download anything, and runs entirely in your browser: the model list and math are static data, so your hardware choices and model preferences are never transmitted anywhere.