Glossary
Plain-language definitions for every piece of terminology on the site. Each term has its own page — tooltips throughout the app deep-link to the detail page for that term.
Token
A "chunk" of text the model reads or writes. English averages roughly 1 token ≈ 4 characters or ¾ of a word. Pricing is almost universally expressed per million tokens.
Context window
The maximum number of tokens (input + output) a model can process in a single request. A 128k context window can fit ~96k words — about a 350-page book. Larger windows enable RAG, long-document summarization, and full-codebase reasoning.
Input vs. output tokens
Input tokens are what you send to the model (prompt + context). Output tokens are what comes back. Output tokens are usually 3–5× more expensive than input — that's the autoregressive generation cost.
Prompt caching / cache hit %
Most major providers (OpenAI, Anthropic, Google) discount or zero-cost input tokens that have been seen recently in the same conversation. RAG and chatbot workloads commonly hit 50–80% cache rates because the system prompt + context is repeated. Set this to your realistic hit rate to get an honest cost estimate.
TTFT (time-to-first-token)
How long after you send a request the first response token arrives. Dominated by prefill latency on long inputs. For UX-critical traffic (chat) this matters more than total throughput.
tok/s (throughput)
Tokens generated per second after the first one. Single-stream numbers (one user) differ a lot from batched numbers (many concurrent users) — modern serving stacks like vLLM achieve 5–10× higher aggregate throughput with continuous batching.
Vision capability
The model accepts images alongside text. Quality varies hugely — text-recognition (OCR) is largely solved across major models; nuanced visual reasoning (charts, diagrams, UI screenshots) is not.
Tools / function calling
The model can structure its response as a request to call a developer-defined function with typed arguments. Essential for agents and structured workflows. Reliability differs between providers — some emit malformed JSON when chained.
JSON mode
A constraint that forces the model's output to be valid JSON. Reduces parse errors compared to "please output JSON" in the prompt. Some providers go further and let you constrain to a specific schema.
Mixture-of-Experts (MoE)
An architecture where each token only activates a subset of the model's parameters. Mixtral 8×7B has 47B total but only ~13B active per token — so it has the cost of a 13B but the quality of something between 13B and 47B.
Quantization (bf16, fp8, awq-int4, gguf-q4 …)
Compressing model weights to fewer bits per parameter. A 70B model in bf16 needs ~140GB VRAM; in fp8 ~70GB; in awq-int4 ~35GB. Quality degrades slightly with each step down — fp8 is nearly indistinguishable from bf16; int4 introduces measurable but usually acceptable quality loss for most use cases.
Batched serving
Running multiple inference requests through the same GPU forward pass. Modern stacks (vLLM, TGI, SGLang) do continuous batching — requests join and leave the batch as they arrive. This is why a self-hosted GPU can serve 50–100 concurrent users at the same per-token cost as one.
RAG (Retrieval-Augmented Generation)
A pattern where you fetch relevant documents from your own corpus (via vector search or keyword), then prepend them to the prompt. The LLM is grounded in your data without retraining. Workloads are heavy on input tokens.
HIPAA BAA
A Business Associate Agreement is a contract that lets a U.S. healthcare entity send Protected Health Information (PHI) to a vendor under HIPAA. Without a signed BAA, sending PHI to an LLM API is a HIPAA violation. Most major LLM providers offer one for enterprise customers.
GDPR DPA
A Data Processing Addendum, mandated by GDPR Article 28, contractually binds the processor (the LLM provider) to the controller (you) on how personal data is handled. Required for any production traffic touching EU residents' data.
SOC 2 Type I vs Type II
Type I attests that controls were designed properly at a point in time. Type II attests that controls operated effectively over a 6–12 month observation window — much stronger. Production B2B procurement almost always wants Type II.
ISO/IEC 27001
An international standard for information security management. Globally recognized; often required by European or APAC enterprise customers in addition to or instead of SOC 2.
Data residency
Where your data physically lives at rest and in transit. "EU-hosted" typically means the inference endpoint and any logs run in EU data centers. Some providers offer per-region endpoints; others run globally and don't guarantee residency.
Zero retention
A configuration (sometimes default, sometimes opt-in for enterprise) where the provider does not store your prompts or completions beyond serving the immediate request. Common practice: zero retention by default for the API, longer retention for ChatGPT-style consumer products.
Always-on vs. inference-only
Always-on = a dedicated GPU rented 24/7 (720 h × $/hr). Inference-only = scale-to-zero serverless billing — RunPod Serverless, Modal, etc. — where you pay only for actual compute time. Always-on is the floor at high utilization; inference-only is the floor at low utilization.
Self-hosting (TCO)
Running an open-weight model yourself on rented (or owned) hardware, instead of paying an inference API. Becomes cheaper than API at sustained high load, especially with batched serving — but adds operational overhead: GPU provisioning, scaling, monitoring, model upgrades.