L LLM Cloud Hub

Glossary

Self-hosting (TCO)

Running open-weight models on your own (or rented) hardware instead of paying an inference API.

Running an open-weight model yourself on rented (or owned) hardware, instead of paying an inference API. Becomes cheaper than API at sustained high load, especially with batched serving — but adds operational overhead: GPU provisioning, scaling, monitoring, model upgrades.

See also

Always-on vs. inference-only

Dedicated 24/7 GPU billing vs. scale-to-zero serverless inference.
Batched serving

Running multiple inference requests through the same GPU forward pass.
Quantization (bf16, fp8, awq-int4, gguf-q4 …)

Compressing model weights to fewer bits per parameter to fit on smaller GPUs.

← Back to full glossary

Keyboard shortcuts

?: Show this overlay
/: Focus the first form field
g h: Go to / (home)
g b: Go to /best-llm-for
g c: Go to /cost
g s: Go to /self-hosted
g x: Go to /compliance
Esc: Close any overlay

Inspired by Linear and GitHub conventions. The two-key sequences (g then h) work within ~1 second.