Llama 3.1 8B Instruct
8B params · Llama 3.1 Community License · meta-llama/Llama-3.1-8B-Instruct
Self-hosted scenarios
| GPU | Quant. Quantization — bf16 is full precision; fp8/awq-int4/gguf-q4 progressively shrink memory at slight quality cost. Glossary → | tok/s Output tokens generated per second, single-stream. Batched serving achieves 5–10× higher aggregate. Glossary → | $/hr | GPU hours / mo | Inference-only Best case — scale-to-zero serverless (RunPod Serverless, Modal). You pay only for the actual compute time you use. Glossary → | Always-on (24/7) One dedicated GPU running all month (720h × $/hr). The floor for any sustained-load deployment. Glossary → |
|---|---|---|---|---|---|---|
|
NVIDIA GeForce RTX 4090 24GB
RunPod (community cloud)
|
bf16 | 85 | 0.3400 | 490.2 | $166.67 | $244.80 |
|
NVIDIA H100 80GB SXM
RunPod (community cloud)
|
bf16 | 250 | 2.4900 | 166.7 | $415.00 | $1792.80 |
|
NVIDIA A100 80GB SXM
RunPod (community cloud)
|
bf16 | 160 | 1.7900 | 260.4 | $466.15 | $1288.80 |
What these numbers mean. Inference-only assumes you can scale GPUs to zero between requests (RunPod Serverless, Modal, etc.) and only pay for the actual compute time — this is the floor. Always-on is the cost of one dedicated GPU running 24/7 (720 h/month) and is what most teams actually pay with a single VM. Throughput numbers are single-stream from public vLLM / TGI / llama.cpp benchmarks; with batched serving, real production deployments achieve 5–10× higher aggregate throughput. Use these as a starting point, not a quote.