Quantization (bf16, fp8, awq-int4, gguf-q4 …)
Compressing model weights to fewer bits per parameter to fit on smaller GPUs.
Compressing model weights to fewer bits per parameter. A 70B model in bf16 needs ~140GB VRAM; in fp8 ~70GB; in awq-int4 ~35GB. Quality degrades slightly with each step down — fp8 is nearly indistinguishable from bf16; int4 introduces measurable but usually acceptable quality loss for most use cases.
-
Mixture-of-Experts (MoE)An architecture where each token activates only a subset of the model's parameters.
-
Self-hosting (TCO)Running open-weight models on your own (or rented) hardware instead of paying an inference API.
-
Batched servingRunning multiple inference requests through the same GPU forward pass.