RAG (Retrieval-Augmented Generation)
Fetching relevant documents and prepending them to the prompt for grounded answers.
A pattern where you fetch relevant documents from your own corpus (via vector search or keyword), then prepend them to the prompt. The LLM is grounded in your data without retraining. Workloads are heavy on input tokens.
-
Context windowThe maximum number of tokens an LLM can process in a single request.
-
Prompt caching / cache hit %A discount providers give for repeated prefix tokens. Common in RAG and chatbots.
-
Input vs. output tokensInput tokens are what you send to the model; output tokens are what it generates back.