What Is Quantization and Why Does It Matter for Running AI Models Locally?

Quantization makes large language models run on consumer hardware by compressing model weights. Learn what Q4_K_M, Q5_K_M, and Q8_0 mean—and which to choose.

What Is Quantization and Why Does It Matter for Running AI Models Locally?

A common question in AI communities like r/LocalLLaMA and r/artificial keeps surfacing: "What does Q4_K_M mean? Should I use Q8? Why is this model 40GB in one file and 5GB in another?" If you have been exploring local LLMs, you have encountered the dizzying array of quantization options. Understanding what quantization actually does—and which flavor to choose—can mean the difference between running a 70 billion parameter model on your laptop or staring at an "out of memory" error.

What Is Quantization, Really?

At its core, quantization is a compression technique that reduces the precision of numbers representing a language model's parameters. Think of it like converting a RAW photograph to JPEG. You lose some fine detail, but the file becomes dramatically smaller and faster to work with.

Modern large language models are trained using 16-bit floating point numbers (FP16 or BF16). Each parameter—the adjustable weights that let the model "learn"—requires two bytes of storage. For a 70 billion parameter model, that means 140 gigabytes of VRAM just to load the weights. Most consumer GPUs have 8-24GB. The math does not work.

Quantization solves this by representing those same parameters with fewer bits. Instead of 16 bits per parameter, you might use 8, 5, or even 4 bits. The model shrinks. Memory requirements drop. Suddenly that 70B parameter behemoth runs on hardware you actually own.

The GGUF Format and K-Quants Explained

The GGUF (GPT-Generated Unified Format) format, developed by the llama.cpp project, has become the standard for running quantized models locally. Within GGUF, you will encounter K-quants—sophisticated quantization methods that minimize quality loss through clever techniques like importance matrix weighting and mixed-bit representation.

Here is what those cryptic filenames actually mean:

Q4_K_M — The Sweet Spot

Approximately 4.5 bits per parameter. Reduces model size by roughly 75% compared to FP16. Quality loss estimates range from 5-10%, though for most conversational and creative tasks, the difference is barely perceptible.

An 8 billion parameter model in Q4_K_M requires about 4.2GB of RAM. A 70B model needs around 39GB. This is the default recommendation for most users because it delivers approximately 90% of the original quality at 25% of the weight.

Use Q4_K_M when: You have 8-16GB of RAM, want general chat capabilities, and need to balance quality against hardware constraints.

Q5_K_M — For the Discerning User

Approximately 5.5 bits per parameter. Reduces size by about 65% with only 3-5% quality loss. The file is roughly 25% larger than Q4_K_M, but the quality improvement is noticeable for coding tasks, precise reasoning, and professional writing.

An 8B Q5_K_M model consumes about 5.5GB. The 70B variant needs approximately 48GB.

Use Q5_K_M when: You have 16-32GB of RAM, do development work, or need higher fidelity output for professional applications.

Q8_0 — Near-Perfect Preservation

Exactly 8 bits per parameter. Halves the model size with only 1-2% quality degradation. Benchmarks struggle to measure the difference between Q8_0 and FP16 in blind tests. For all practical purposes, this is indistinguishable from the original.

The tradeoff is size. An 8B model still requires 8.5GB. The 70B version demands 74GB.

Use Q8_0 when: You have 32GB+ of RAM, are doing research, or require maximum quality for fine-tuning operations.

FP16/BF16 — The Reference Standard

Sixteen bits per parameter. No compression. No quality loss. This is how models are trained and represents the absolute reference point for quality comparisons.

A 70B FP16 model is 140GB. Even an 8B model needs 16GB. These are essentially inaccessible for local use without professional hardware.

Use FP16 when: You are training or fine-tuning models, have datacenter-grade hardware, or need the absolute reference implementation.

The Hardware Decision Matrix

Choosing the right quantization depends entirely on your available memory. Here is a practical guide:

Your RAM Recommended Quant Max Model Size
8 GB Q4_K_M 7-8B parameters
16 GB Q5_K_M (8B) / Q4_K_M (13-14B) 13-14B parameters
32 GB Q5_K_M (32B) / Q8_0 (8B) 30-32B parameters
64 GB+ Q8_0 for most models 70B+ parameters

Beyond Quality: Speed and Efficiency

Quantization affects more than just output quality. It directly impacts inference speed. Less data to process means faster generation.

Q4_K_M typically runs 20-30% faster than FP16. Q5_K_M sees 15-20% speedups. Modern GPUs with native INT4 and INT8 support—NVIDIA's RTX 30 and 40 series, Apple Silicon—can achieve 2-3x acceleration compared to running FP16 on CPU.

This means quantization not only makes models accessible but actually improves the user experience through faster token generation. Your quantized model responds quicker while using a fraction of the resources.

The Quality Tradeoff: What Actually Suffers?

Not all tasks degrade equally under quantization. Understanding where quality loss appears helps you choose wisely.

Minimal impact: General conversation, creative writing, brainstorming, question answering, and most text generation tasks. The difference between Q4_K_M and FP16 for casual chat is essentially undetectable.

Moderate impact: Coding assistance, mathematical reasoning, and precise factual recall. Q5_K_M or Q8_0 perform noticeably better for programming tasks.

Maximum impact: Complex multi-step reasoning, precise numerical computation, and edge-case prompts designed to break models. Here, higher quantization levels maintain their advantage.

For most users, Q4_K_M handles 90% of use cases flawlessly. If you notice your quantized model struggling with specific tasks—particularly code generation or complex reasoning—try stepping up to Q5_K_M before abandoning local inference entirely.

Practical Recommendations for 2026

If you are just starting with local LLMs, default to Q4_K_M. It offers the best balance of accessibility and quality. You can run capable 8B models on modest hardware and experiment with larger models as your setup allows.

For developers and professionals who rely on AI assistance for coding or detailed writing, invest in hardware that supports Q5_K_M for your target model sizes. The quality improvement justifies the RAM upgrade.

Researchers and those doing fine-tuning should target Q8_0 or FP16. The marginal gains in quality matter when you are pushing model capabilities to their limits or training new behaviors.

Remember that quantization is not a one-size-fits-all decision. Many enthusiasts maintain multiple versions of the same model—Q4_K_M for quick tasks, Q5_K_M for serious work. Your hardware and use case should drive the choice, not abstract quality metrics.

The Bottom Line

Quantization is the technology that democratized access to large language models. Without it, local AI would remain the exclusive domain of datacenters and research institutions. Q4_K_M gives you 90%+ of a model's capability at 25% of the resource cost. Q5_K_M pushes that to 95%+ at 35% of the cost.

The question is no longer whether you can run AI locally. It is simply which quantization level matches your hardware and requirements. Start with Q4_K_M, upgrade if you notice quality issues, and enjoy the remarkable fact that consumer hardware now runs models that required supercomputers just a few years ago.

Sources:

  1. LocalClaw — GGUF Quantization Guide: Q4, Q5, Q8 Explained
  2. Enclave AI — LLM Quantization Explained: Run Bigger Models on Less RAM
  3. LLM Hardware — LLM Quantization Explained: Q4, Q8, FP16 and VRAM Tradeoffs