Self-host vs API

Llama 3.3 70B Instruct

70B params · Llama 3.3 Community License · meta-llama/Llama-3.3-70B-Instruct

Cheapest API option for these weights

$0.1000 in / $0.3200 out per 1M tokens

Monthly @ this workload

$108.00

Self-hosted scenarios

GPU	Quant.	tok/s	$/hr	GPU hours / mo	Inference-only	Always-on (24/7)
NVIDIA H100 80GB SXM RunPod (community cloud)	fp8	110	2.4900	378.8	$943.18	$1792.80
NVIDIA A100 80GB SXM RunPod (community cloud)	awq-int4	50	1.7900	833.3	$1491.67	$1288.80
8× NVIDIA H100 80GB (single node, NVLink) RunPod (secure cloud)	bf16	80	19.9200	520.8	$10375.00	$14342.40

What these numbers mean. Inference-only assumes you can scale GPUs to zero between requests (RunPod Serverless, Modal, etc.) and only pay for the actual compute time — this is the floor. Always-on is the cost of one dedicated GPU running 24/7 (720 h/month) and is what most teams actually pay with a single VM. Throughput numbers are single-stream from public vLLM / TGI / llama.cpp benchmarks; with batched serving, real production deployments achieve 5–10× higher aggregate throughput. Use these as a starting point, not a quote.

Llama 3.3 70B Instruct

Self-hosted scenarios

Keyboard shortcuts