Self-host vs API

Phi-4 14B

Cheapest API option for these weights

$0.0650 in / $0.1400 out per 1M tokens

Monthly @ this workload

$60.00

Self-hosted scenarios

GPU	Quant.	tok/s	$/hr	GPU hours / mo	Inference-only	Always-on (24/7)
NVIDIA GeForce RTX 4090 24GB RunPod (community cloud)	gguf-q4	55	0.3400	757.6	$257.58	$244.80
NVIDIA RTX A6000 48GB RunPod (community cloud)	bf16	65	0.4900	641.0	$314.10	$352.80

What these numbers mean. Inference-only assumes you can scale GPUs to zero between requests (RunPod Serverless, Modal, etc.) and only pay for the actual compute time — this is the floor. Always-on is the cost of one dedicated GPU running 24/7 (720 h/month) and is what most teams actually pay with a single VM. Throughput numbers are single-stream from public vLLM / TGI / llama.cpp benchmarks; with batched serving, real production deployments achieve 5–10× higher aggregate throughput. Use these as a starting point, not a quote.

Phi-4 14B

Self-hosted scenarios

Keyboard shortcuts