📦 Context Window Explained

Simple Explanation

A context window is the total amount of text an LLM can "see" at one time — including both your input and its output. Think of it as the model's short-term memory.

Imagine you're having a conversation, but you can only remember the last 10 minutes. Everything before that? Gone. That's essentially how a context window works — the model can only work with text that fits inside its window.

Why This Matters

The context window is one of the most important practical constraints when working with AI:

It determines how long a conversation you can have before the AI "forgets"
It limits how much reference material you can include in a prompt
It affects whether the AI can process entire documents or only excerpts
It's a major factor in choosing which model to use for a task
Running out of context window is one of the most common sources of frustrating AI behavior

Understanding Context Windows in Detail

How Context Windows Are Measured

Context windows are measured in tokens (the units we covered in the previous lesson):

Model	Context Window	Approximate Words
GPT-3.5 Turbo	16K tokens	~12,000 words
GPT-4o	128K tokens	~96,000 words
Claude 3.5 Sonnet	200K tokens	~150,000 words
Gemini 1.5 Pro	1M tokens	~750,000 words
Llama 3 (8B)	8K tokens	~6,000 words

To put this in perspective:

8K tokens     ≈ A long blog post
16K tokens    ≈ A short story
32K tokens    ≈ A research paper
128K tokens   ≈ A full novel (like Harry Potter and the Sorcerer's Stone)
200K tokens   ≈ 2-3 novels
1M tokens     ≈ An entire textbook series

The Context Window Includes EVERYTHING

This is a critical point many people miss. The context window includes:

┌──────────────────────────────────────────┐
│            CONTEXT WINDOW                │
│                                          │
│  ┌────────────────────────────────────┐  │
│  │  System prompt / instructions      │  │
│  ├────────────────────────────────────┤  │
│  │  Conversation history              │  │
│  │  (all previous messages)           │  │
│  ├────────────────────────────────────┤  │
│  │  Your current prompt               │  │
│  ├────────────────────────────────────┤  │
│  │  Any attached documents/context    │  │
│  ├────────────────────────────────────┤  │
│  │  The AI's response                 │  │
│  └────────────────────────────────────┘  │
│                                          │
└──────────────────────────────────────────┘

So if you have a 16K token window:

System prompt uses 500 tokens
Conversation history uses 8,000 tokens
Your new prompt uses 200 tokens
That leaves only 7,300 tokens for the AI's response

What Happens When You Hit the Limit?

When a conversation exceeds the context window, different systems handle it differently:

Truncation — The oldest messages are dropped (most common)
Summarization — The system summarizes older messages to save space
Error — The API returns an error saying the input is too long
Sliding Window — Only the most recent N tokens are kept

The dangerous part: The AI won't tell you it's forgotten something. It will just proceed without that information, potentially giving incorrect or inconsistent answers.

Managing Long Conversations

Here are strategies to work effectively within context window limits:

Strategy 1: Start Fresh When Needed

If a conversation is getting long and the AI starts giving inconsistent answers, start a new conversation with a clear summary of what was decided so far.

Strategy 2: Front-Load Important Information

Put the most critical information at the beginning of your prompt. Research shows LLMs pay more attention to the start and end of the context, sometimes losing focus in the middle (the "lost in the middle" problem).

Strategy 3: Use Summaries Instead of Full Text

Instead of pasting an entire 50-page document, paste a summary or the relevant sections.

Strategy 4: Be Explicit About What to Remember

Tell the AI what's important:

Key context to remember throughout this conversation:
- We're building a React app for healthcare
- The budget is $50,000
- Deadline is March 2026
- Must be HIPAA compliant

Prompt Example

❌ Bad Example

[Pastes entire 200-page document]

What does this document say about pricing?

Even if the model's context window is large enough, pasting hundreds of pages and asking a vague question leads to poor results. The model may miss the relevant section or give a superficial overview. You're also wasting tokens and money.

✅ Improved Example

I have a 200-page vendor contract. Here is the section about pricing 
(pages 45-52):

[Paste only the relevant 8 pages]

Based on this pricing section, answer these specific questions:
1. What is the base monthly fee?
2. Are there any volume discounts? If so, at what thresholds?
3. What are the overage charges per unit?
4. When is the first price review date?

Quote the exact contract language for each answer.

By pasting only the relevant section and asking specific questions, you use the context window efficiently and get precise, useful answers.

Try It Yourself

🧪 Try It Yourself

Edit the prompt and click Run to see the AI response.

Practice Challenge

Context Window Management Exercise:

Start a long conversation with an AI (at least 20 back-and-forth messages) on a complex topic
At message 5, establish a specific fact (e.g., "Remember: the budget is exactly $47,500")
At message 15, reference that detail casually
At message 20+, ask the AI to recall the exact budget number

Did it remember? If not, you've just experienced context window limitations firsthand.

Now try the fix: Start a new conversation, include the key facts at the top of your message, and see the difference.

Real-World Scenario

Scenario: You're a developer tasked with building an AI-powered document analysis tool. Understanding context windows is essential for your architecture decisions.

I'm building a document analysis tool that needs to process legal 
contracts ranging from 10 to 500 pages. Help me design an approach 
that works within LLM context window limitations.

Consider:
1. Most contracts are too long for a single LLM call
2. Important clauses can reference other sections of the document
3. Users want to ask questions about the entire document
4. We need to keep costs reasonable

Propose an architecture that includes:
- How to chunk the documents
- How to maintain cross-reference context
- A retrieval strategy (RAG or similar)
- Which model(s) to use and why
- Estimated token usage per query

Present this as a technical design document with diagrams described 
in text format.

Interview Question

"How would you handle a task that requires processing a document larger than the model's context window?"

Strong Answer: There are several established approaches for handling documents that exceed the context window. The most common is Retrieval-Augmented Generation (RAG), where you split the document into chunks, create embeddings for each chunk, store them in a vector database, and at query time retrieve only the most relevant chunks to include in the prompt. Another approach is map-reduce summarization: you process each chunk independently (map), then combine the results (reduce). For tasks requiring holistic understanding, you can create a hierarchical summary — summarize sections, then summarize the summaries. You can also use models with larger context windows like Claude (200K tokens) or Gemini (1M tokens) for moderately large documents. The right approach depends on the task: RAG works best for question-answering, map-reduce for summarization, and long-context models for tasks requiring full document understanding. In practice, I often combine approaches — using a long-context model with the most relevant sections pre-selected via embedding similarity.

Summary

The context window is the total tokens an LLM can process at once (input + output)
Different models have different sizes — from 8K to 1M+ tokens
The window includes system prompts, conversation history, your input, AND the response
When the window fills up, the model loses older information (usually silently)
Strategies: start fresh, front-load key info, use summaries, be explicit about priorities
Paste only relevant sections of documents, not the whole thing
For large documents, use techniques like RAG or chunking
Context window management is a critical skill for production AI applications

Simple Explanation​

Why This Matters​

Understanding Context Windows in Detail​

How Context Windows Are Measured​

The Context Window Includes EVERYTHING​

What Happens When You Hit the Limit?​

Managing Long Conversations​

Strategy 1: Start Fresh When Needed​

Strategy 2: Front-Load Important Information​

Strategy 3: Use Summaries Instead of Full Text​

Strategy 4: Be Explicit About What to Remember​

Prompt Example​

❌ Bad Example​

✅ Improved Example​

Try It Yourself​