Self-hosting (TCO)
Running open-weight models on your own (or rented) hardware instead of paying an inference API.
Running an open-weight model yourself on rented (or owned) hardware, instead of paying an inference API. Becomes cheaper than API at sustained high load, especially with batched serving — but adds operational overhead: GPU provisioning, scaling, monitoring, model upgrades.
-
Always-on vs. inference-onlyDedicated 24/7 GPU billing vs. scale-to-zero serverless inference.
-
Batched servingRunning multiple inference requests through the same GPU forward pass.
-
Quantization (bf16, fp8, awq-int4, gguf-q4 …)Compressing model weights to fewer bits per parameter to fit on smaller GPUs.