Batched serving
Running multiple inference requests through the same GPU forward pass.
Running multiple inference requests through the same GPU forward pass. Modern stacks (vLLM, TGI, SGLang) do continuous batching — requests join and leave the batch as they arrive. This is why a self-hosted GPU can serve 50–100 concurrent users at the same per-token cost as one.
-
Self-hosting (TCO)Running open-weight models on your own (or rented) hardware instead of paying an inference API.
-
tok/s (throughput)Tokens generated per second after the first one.
-
Quantization (bf16, fp8, awq-int4, gguf-q4 …)Compressing model weights to fewer bits per parameter to fit on smaller GPUs.