L LLM Cloud Hub
Self-host vs API

Qwen 2.5 72B Instruct

72B params · Qwen License · Qwen/Qwen2.5-72B-Instruct

Cheapest API option for these weights
Qwen: Qwen2.5 72B Instruct
$0.3600 in / $0.4000 out per 1M tokens
Monthly @ this workload
$276.00

Self-hosted scenarios

GPU Quant. Quantization — bf16 is full precision; fp8/awq-int4/gguf-q4 progressively shrink memory at slight quality cost. Glossary → tok/s Output tokens generated per second, single-stream. Batched serving achieves 5–10× higher aggregate. Glossary → $/hr GPU hours / mo Inference-only Best case — scale-to-zero serverless (RunPod Serverless, Modal). You pay only for the actual compute time you use. Glossary → Always-on (24/7) One dedicated GPU running all month (720h × $/hr). The floor for any sustained-load deployment. Glossary →
NVIDIA H100 80GB SXM
RunPod (community cloud)
fp8 100 2.4900 416.7 $1037.50 $1792.80
NVIDIA A100 80GB SXM
RunPod (community cloud)
awq-int4 45 1.7900 925.9 $1657.41 $1288.80

What these numbers mean. Inference-only assumes you can scale GPUs to zero between requests (RunPod Serverless, Modal, etc.) and only pay for the actual compute time — this is the floor. Always-on is the cost of one dedicated GPU running 24/7 (720 h/month) and is what most teams actually pay with a single VM. Throughput numbers are single-stream from public vLLM / TGI / llama.cpp benchmarks; with batched serving, real production deployments achieve 5–10× higher aggregate throughput. Use these as a starting point, not a quote.

Keyboard shortcuts

?
Show this overlay
/
Focus the first form field
g h
Go to / (home)
g b
Go to /best-llm-for
g c
Go to /cost
g s
Go to /self-hosted
g x
Go to /compliance
Esc
Close any overlay

Inspired by Linear and GitHub conventions. The two-key sequences (g then h) work within ~1 second.