Introduction
Context window size has grown from 2K tokens in early GPT models to 200K tokens in GPT-4, 200K in Claude 3, and even 1 million tokens in Gemini 1.5 Pro. Despite this growth, effective context management remains critical — models attend less effectively to information in the middle of long contexts, token costs scale with context length, and response latency increases. This guide covers strategies for managing context windows in production.
The "Lost in the Middle" Problem
Research consistently shows that LLMs perform best when relevant information appears at the beginning or end of the context window. Information in the middle is more likely to be ignored or incorrectly processed.
**Performance by position (approximate):**
This has direct implications for RAG systems: placing the most relevant documents at the beginning and end of the context improves answer quality even if less relevant documents are included in the middle.
Context Budgeting
Treat context like a budget. Allocate tokens deliberately:
Total context: 100K tokens (example)
- System prompt: 2K tokens (2%)
- Conversation history: 10K tokens (10%)
- Retrieved documents: 80K tokens (80%)
- Current query + formatting: 8K tokens (8%)
For each allocation, ask:
Strategies for Long Conversations
Sliding Window
Keep only the most recent N turns of conversation:
def get_conversation_context(conversation, max_turns=10):
"""Keep only the most recent max_turns of conversation."""
trimmed = conversation[-max_turns:]
# Include a summary of earlier turns if needed
if len(conversation) > max_turns:
summary = summarize_conversation(conversation[:-max_turns])
trimmed = [{"role": "system", "content": f"Earlier summary: {summary}"}] + trimmed
return trimmed
Conversation Summarization
Periodically summarize the conversation and replace older messages:
def summarize_and_trim(messages, summary_threshold=20):
if len(messages) <= summary_threshold:
return messages
to_summarize = messages[:len(messages) - summary_threshold]
summary_prompt = "Summarize the key points from this conversation, preserving any critical information the user has provided:"
summary = call_llm(summary_prompt, to_summarize)
remaining = messages[-summary_threshold:]
return [{"role": "system", "content": f"Conversation summary: {summary}"}] + remaining
Hierarchical Summarization
For very long conversations, maintain a hierarchy of summaries:
Level 0: Full conversation (raw messages)
Level 1: Hourly summaries
Level 2: Daily summaries
Level 3: Conversation summary (per session)
When context is full, replace Level 0 messages with Level 1 summaries, then Level 2, etc.
RAG Context Management
Document Ranking for Context
When multiple documents are retrieved but context is limited:
2. Rank by relevance using a cross-encoder
3. Fill context window starting with the most relevant documents
4. Place the top document at the END of the context (proven best position)
5. Place the second-best document at the BEGINNING
6. Fill the middle with remaining documents
Chunk-Level Re-Ranking
Instead of ranking entire documents, rank individual chunks. A single relevant paragraph from a marginal document may be more useful than the entire top document:
def fill_context(chunks, max_tokens):
"""Select chunks to fill the context window."""
ranked = cross_encoder_rank(chunks, query)
selected = []
token_count = 0
# Always include the top chunk
top = ranked[0]
if token_count + len(top) <= max_tokens:
selected.append(top)
token_count += len(top)
# Fill from both ends inward
for chunk in ranked[1:]:
if token_count + len(chunk) <= max_tokens:
selected.append(chunk)
token_count += len(chunk)
else:
break
# Reorder: best chunk last, second best first, rest in middle
return reorder_for_positioning(selected)
Long Document Processing
Map-Reduce for Very Long Documents
Split the document, process each section independently, then combine:
def process_long_document(document, chunk_size=4000):
chunks = split_into_chunks(document, chunk_size)
summaries = []
for chunk in chunks:
summary = call_llm("Summarize this section:", chunk)
summaries.append(summary)
final_summary = call_llm(
"Combine these section summaries into a coherent overview:",
"\n\n".join(summaries)
)
return final_summary
Iterative Refinement
For analysis of long documents, iterate with targeted queries:
2. Query model for what's missing or unclear
3. Retrieve specific sections to fill gaps
4. Generate refined analysis
Monitoring Context Usage
Track these metrics per request:
Conclusion
Effective context management is essential for building reliable LLM applications, regardless of context window size. Prioritize important information, use hierarchical summarization for long conversations, rank documents carefully in RAG systems, and monitor context usage in production. The models with million-token windows are impressive, but they work best when you treat their attention with respect rather than as unlimited storage.