L LLM Cloud Hub

Glossary

Quantization (bf16, fp8, awq-int4, gguf-q4 …)

Compressing model weights to fewer bits per parameter to fit on smaller GPUs.

Compressing model weights to fewer bits per parameter. A 70B model in bf16 needs ~140GB VRAM; in fp8 ~70GB; in awq-int4 ~35GB. Quality degrades slightly with each step down — fp8 is nearly indistinguishable from bf16; int4 introduces measurable but usually acceptable quality loss for most use cases.

See also

Mixture-of-Experts (MoE)

An architecture where each token activates only a subset of the model's parameters.
Self-hosting (TCO)

Running open-weight models on your own (or rented) hardware instead of paying an inference API.
Batched serving

Running multiple inference requests through the same GPU forward pass.

← Back to full glossary

Keyboard shortcuts

?: Show this overlay
/: Focus the first form field
g h: Go to / (home)
g b: Go to /best-llm-for
g c: Go to /cost
g s: Go to /self-hosted
g x: Go to /compliance
Esc: Close any overlay

Inspired by Linear and GitHub conventions. The two-key sequences (g then h) work within ~1 second.