L LLM Cloud Hub

Reference

Glossary

Plain-language definitions for every piece of terminology on the site. Each term has its own page — tooltips throughout the app deep-link to the detail page for that term.

Token

A "chunk" of text the model reads or writes. English averages roughly 1 token ≈ 4 characters or ¾ of a word. Pricing is almost universally expressed per million tokens.

Read definition →

Context window

The maximum number of tokens (input + output) a model can process in a single request. A 128k context window can fit ~96k words — about a 350-page book. Larger windows enable RAG, long-document summarization, and full-codebase reasoning.

Read definition →

Input vs. output tokens

Input tokens are what you send to the model (prompt + context). Output tokens are what comes back. Output tokens are usually 3–5× more expensive than input — that's the autoregressive generation cost.

Read definition →

Prompt caching / cache hit %

Most major providers (OpenAI, Anthropic, Google) discount or zero-cost input tokens that have been seen recently in the same conversation. RAG and chatbot workloads commonly hit 50–80% cache rates because the system prompt + context is repeated. Set this to your realistic hit rate to get an honest cost estimate.

Read definition →

TTFT (time-to-first-token)

How long after you send a request the first response token arrives. Dominated by prefill latency on long inputs. For UX-critical traffic (chat) this matters more than total throughput.

Read definition →

tok/s (throughput)

Tokens generated per second after the first one. Single-stream numbers (one user) differ a lot from batched numbers (many concurrent users) — modern serving stacks like vLLM achieve 5–10× higher aggregate throughput with continuous batching.

Read definition →

Vision capability

The model accepts images alongside text. Quality varies hugely — text-recognition (OCR) is largely solved across major models; nuanced visual reasoning (charts, diagrams, UI screenshots) is not.

Read definition →

Tools / function calling

The model can structure its response as a request to call a developer-defined function with typed arguments. Essential for agents and structured workflows. Reliability differs between providers — some emit malformed JSON when chained.

Read definition →

JSON mode

A constraint that forces the model's output to be valid JSON. Reduces parse errors compared to "please output JSON" in the prompt. Some providers go further and let you constrain to a specific schema.

Read definition →

Mixture-of-Experts (MoE)

An architecture where each token only activates a subset of the model's parameters. Mixtral 8×7B has 47B total but only ~13B active per token — so it has the cost of a 13B but the quality of something between 13B and 47B.

Read definition →

Quantization (bf16, fp8, awq-int4, gguf-q4 …)

Compressing model weights to fewer bits per parameter. A 70B model in bf16 needs ~140GB VRAM; in fp8 ~70GB; in awq-int4 ~35GB. Quality degrades slightly with each step down — fp8 is nearly indistinguishable from bf16; int4 introduces measurable but usually acceptable quality loss for most use cases.

Read definition →

Batched serving

Running multiple inference requests through the same GPU forward pass. Modern stacks (vLLM, TGI, SGLang) do continuous batching — requests join and leave the batch as they arrive. This is why a self-hosted GPU can serve 50–100 concurrent users at the same per-token cost as one.

Read definition →

RAG (Retrieval-Augmented Generation)

A pattern where you fetch relevant documents from your own corpus (via vector search or keyword), then prepend them to the prompt. The LLM is grounded in your data without retraining. Workloads are heavy on input tokens.

Read definition →

HIPAA BAA

A Business Associate Agreement is a contract that lets a U.S. healthcare entity send Protected Health Information (PHI) to a vendor under HIPAA. Without a signed BAA, sending PHI to an LLM API is a HIPAA violation. Most major LLM providers offer one for enterprise customers.

Read definition →

SOC 2 Type I vs Type II

Type I attests that controls were designed properly at a point in time. Type II attests that controls operated effectively over a 6–12 month observation window — much stronger. Production B2B procurement almost always wants Type II.

Read definition →

ISO/IEC 27001

An international standard for information security management. Globally recognized; often required by European or APAC enterprise customers in addition to or instead of SOC 2.

Read definition →

Data residency

Where your data physically lives at rest and in transit. "EU-hosted" typically means the inference endpoint and any logs run in EU data centers. Some providers offer per-region endpoints; others run globally and don't guarantee residency.

Read definition →

Zero retention

A configuration (sometimes default, sometimes opt-in for enterprise) where the provider does not store your prompts or completions beyond serving the immediate request. Common practice: zero retention by default for the API, longer retention for ChatGPT-style consumer products.

Read definition →

Always-on vs. inference-only

Always-on = a dedicated GPU rented 24/7 (720 h × $/hr). Inference-only = scale-to-zero serverless billing — RunPod Serverless, Modal, etc. — where you pay only for actual compute time. Always-on is the floor at high utilization; inference-only is the floor at low utilization.

Read definition →

Self-hosting (TCO)

Running an open-weight model yourself on rented (or owned) hardware, instead of paying an inference API. Becomes cheaper than API at sustained high load, especially with batched serving — but adds operational overhead: GPU provisioning, scaling, monitoring, model upgrades.

Read definition →

Keyboard shortcuts

?: Show this overlay
/: Focus the first form field
g h: Go to / (home)
g b: Go to /best-llm-for
g c: Go to /cost
g s: Go to /self-hosted
g x: Go to /compliance
Esc: Close any overlay

Inspired by Linear and GitHub conventions. The two-key sequences (g then h) work within ~1 second.