๐ฆ Context Window Explained
Simple Explanationโ
A context window is the total amount of text an LLM can "see" at one time โ including both your input and its output. Think of it as the model's short-term memory.
Imagine you're having a conversation, but you can only remember the last 10 minutes. Everything before that? Gone. That's essentially how a context window works โ the model can only work with text that fits inside its window.
Why This Mattersโ
The context window is one of the most important practical constraints when working with AI:
- It determines how long a conversation you can have before the AI "forgets"
- It limits how much reference material you can include in a prompt
- It affects whether the AI can process entire documents or only excerpts
- It's a major factor in choosing which model to use for a task
- Running out of context window is one of the most common sources of frustrating AI behavior
Understanding Context Windows in Detailโ
How Context Windows Are Measuredโ
Context windows are measured in tokens (the units we covered in the previous lesson):
| Model | Context Window | Approximate Words |
|---|---|---|
| GPT-3.5 Turbo | 16K tokens | ~12,000 words |
| GPT-4o | 128K tokens | ~96,000 words |
| Claude 3.5 Sonnet | 200K tokens | ~150,000 words |
| Gemini 1.5 Pro | 1M tokens | ~750,000 words |
| Llama 3 (8B) | 8K tokens | ~6,000 words |
To put this in perspective:
8K tokens โ A long blog post
16K tokens โ A short story
32K tokens โ A research paper
128K tokens โ A full novel (like Harry Potter and the Sorcerer's Stone)
200K tokens โ 2-3 novels
1M tokens โ An entire textbook series
The Context Window Includes EVERYTHINGโ
This is a critical point many people miss. The context window includes:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONTEXT WINDOW โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ System prompt / instructions โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Conversation history โ โ
โ โ (all previous messages) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Your current prompt โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Any attached documents/context โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ The AI's response โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
So if you have a 16K token window:
- System prompt uses 500 tokens
- Conversation history uses 8,000 tokens
- Your new prompt uses 200 tokens
- That leaves only 7,300 tokens for the AI's response
What Happens When You Hit the Limit?โ
When a conversation exceeds the context window, different systems handle it differently:
- Truncation โ The oldest messages are dropped (most common)
- Summarization โ The system summarizes older messages to save space
- Error โ The API returns an error saying the input is too long
- Sliding Window โ Only the most recent N tokens are kept
The dangerous part: The AI won't tell you it's forgotten something. It will just proceed without that information, potentially giving incorrect or inconsistent answers.
Managing Long Conversationsโ
Here are strategies to work effectively within context window limits:
Strategy 1: Start Fresh When Neededโ
If a conversation is getting long and the AI starts giving inconsistent answers, start a new conversation with a clear summary of what was decided so far.
Strategy 2: Front-Load Important Informationโ
Put the most critical information at the beginning of your prompt. Research shows LLMs pay more attention to the start and end of the context, sometimes losing focus in the middle (the "lost in the middle" problem).
Strategy 3: Use Summaries Instead of Full Textโ
Instead of pasting an entire 50-page document, paste a summary or the relevant sections.
Strategy 4: Be Explicit About What to Rememberโ
Tell the AI what's important:
Key context to remember throughout this conversation:
- We're building a React app for healthcare
- The budget is $50,000
- Deadline is March 2026
- Must be HIPAA compliant
Prompt Exampleโ
โ Bad Exampleโ
[Pastes entire 200-page document]
What does this document say about pricing?
Even if the model's context window is large enough, pasting hundreds of pages and asking a vague question leads to poor results. The model may miss the relevant section or give a superficial overview. You're also wasting tokens and money.
โ Improved Exampleโ
I have a 200-page vendor contract. Here is the section about pricing
(pages 45-52):
[Paste only the relevant 8 pages]
Based on this pricing section, answer these specific questions:
1. What is the base monthly fee?
2. Are there any volume discounts? If so, at what thresholds?
3. What are the overage charges per unit?
4. When is the first price review date?
Quote the exact contract language for each answer.
By pasting only the relevant section and asking specific questions, you use the context window efficiently and get precise, useful answers.
Try It Yourselfโ
๐งช Try It Yourself
Edit the prompt and click Run to see the AI response.
Context Window Management Exercise:
- Start a long conversation with an AI (at least 20 back-and-forth messages) on a complex topic
- At message 5, establish a specific fact (e.g., "Remember: the budget is exactly $47,500")
- At message 15, reference that detail casually
- At message 20+, ask the AI to recall the exact budget number
Did it remember? If not, you've just experienced context window limitations firsthand.
Now try the fix: Start a new conversation, include the key facts at the top of your message, and see the difference.
Real-World Scenarioโ
Scenario: You're a developer tasked with building an AI-powered document analysis tool. Understanding context windows is essential for your architecture decisions.
I'm building a document analysis tool that needs to process legal
contracts ranging from 10 to 500 pages. Help me design an approach
that works within LLM context window limitations.
Consider:
1. Most contracts are too long for a single LLM call
2. Important clauses can reference other sections of the document
3. Users want to ask questions about the entire document
4. We need to keep costs reasonable
Propose an architecture that includes:
- How to chunk the documents
- How to maintain cross-reference context
- A retrieval strategy (RAG or similar)
- Which model(s) to use and why
- Estimated token usage per query
Present this as a technical design document with diagrams described
in text format.
"How would you handle a task that requires processing a document larger than the model's context window?"
Strong Answer: There are several established approaches for handling documents that exceed the context window. The most common is Retrieval-Augmented Generation (RAG), where you split the document into chunks, create embeddings for each chunk, store them in a vector database, and at query time retrieve only the most relevant chunks to include in the prompt. Another approach is map-reduce summarization: you process each chunk independently (map), then combine the results (reduce). For tasks requiring holistic understanding, you can create a hierarchical summary โ summarize sections, then summarize the summaries. You can also use models with larger context windows like Claude (200K tokens) or Gemini (1M tokens) for moderately large documents. The right approach depends on the task: RAG works best for question-answering, map-reduce for summarization, and long-context models for tasks requiring full document understanding. In practice, I often combine approaches โ using a long-context model with the most relevant sections pre-selected via embedding similarity.
- The context window is the total tokens an LLM can process at once (input + output)
- Different models have different sizes โ from 8K to 1M+ tokens
- The window includes system prompts, conversation history, your input, AND the response
- When the window fills up, the model loses older information (usually silently)
- Strategies: start fresh, front-load key info, use summaries, be explicit about priorities
- Paste only relevant sections of documents, not the whole thing
- For large documents, use techniques like RAG or chunking
- Context window management is a critical skill for production AI applications