LLM Context Window Management

Introduction

Context window size has grown from 2K tokens in early GPT models to 200K tokens in GPT-4, 200K in Claude 3, and even 1 million tokens in Gemini 1.5 Pro. Despite this growth, effective context management remains critical — models attend less effectively to information in the middle of long contexts, token costs scale with context length, and response latency increases. This guide covers strategies for managing context windows in production.

The "Lost in the Middle" Problem

Research consistently shows that LLMs perform best when relevant information appears at the beginning or end of the context window. Information in the middle is more likely to be ignored or incorrectly processed.

**Performance by position (approximate):**

Beginning (first 20%): Best recall, 90%+ accuracy

Middle (20-80%): Worst recall, 60-70% accuracy

End (last 20%): Good recall, 85%+ accuracy

This has direct implications for RAG systems: placing the most relevant documents at the beginning and end of the context improves answer quality even if less relevant documents are included in the middle.

Context Budgeting

Treat context like a budget. Allocate tokens deliberately:


Total context: 100K tokens (example)

- System prompt: 2K tokens (2%)

- Conversation history: 10K tokens (10%)

- Retrieved documents: 80K tokens (80%)

- Current query + formatting: 8K tokens (8%)

For each allocation, ask:

Does every token in the system prompt need to be there?

Can conversation history be summarized instead of included verbatim?

Are all retrieved documents equally relevant?

Can the query be compressed?

Strategies for Long Conversations

Sliding Window

Keep only the most recent N turns of conversation:


def get_conversation_context(conversation, max_turns=10):

    """Keep only the most recent max_turns of conversation."""

    trimmed = conversation[-max_turns:]

    # Include a summary of earlier turns if needed

    if len(conversation) > max_turns:

        summary = summarize_conversation(conversation[:-max_turns])

        trimmed = [{"role": "system", "content": f"Earlier summary: {summary}"}] + trimmed

    return trimmed

Conversation Summarization

Periodically summarize the conversation and replace older messages:


def summarize_and_trim(messages, summary_threshold=20):

    if len(messages) <= summary_threshold:

        return messages



    to_summarize = messages[:len(messages) - summary_threshold]

    summary_prompt = "Summarize the key points from this conversation, preserving any critical information the user has provided:"

    summary = call_llm(summary_prompt, to_summarize)



    remaining = messages[-summary_threshold:]

    return [{"role": "system", "content": f"Conversation summary: {summary}"}] + remaining

Hierarchical Summarization

For very long conversations, maintain a hierarchy of summaries:


Level 0: Full conversation (raw messages)

Level 1: Hourly summaries

Level 2: Daily summaries

Level 3: Conversation summary (per session)

When context is full, replace Level 0 messages with Level 1 summaries, then Level 2, etc.

RAG Context Management

Document Ranking for Context

When multiple documents are retrieved but context is limited:

Retrieve many documents (high recall)

2. Rank by relevance using a cross-encoder

3. Fill context window starting with the most relevant documents

4. Place the top document at the END of the context (proven best position)

5. Place the second-best document at the BEGINNING

6. Fill the middle with remaining documents

Chunk-Level Re-Ranking

Instead of ranking entire documents, rank individual chunks. A single relevant paragraph from a marginal document may be more useful than the entire top document:


def fill_context(chunks, max_tokens):

    """Select chunks to fill the context window."""

    ranked = cross_encoder_rank(chunks, query)

    selected = []

    token_count = 0



    # Always include the top chunk

    top = ranked[0]

    if token_count + len(top) <= max_tokens:

        selected.append(top)

        token_count += len(top)



    # Fill from both ends inward

    for chunk in ranked[1:]:

        if token_count + len(chunk) <= max_tokens:

            selected.append(chunk)

            token_count += len(chunk)

        else:

            break



    # Reorder: best chunk last, second best first, rest in middle

    return reorder_for_positioning(selected)

Long Document Processing

Map-Reduce for Very Long Documents

Split the document, process each section independently, then combine:


def process_long_document(document, chunk_size=4000):

    chunks = split_into_chunks(document, chunk_size)

    summaries = []



    for chunk in chunks:

        summary = call_llm("Summarize this section:", chunk)

        summaries.append(summary)



    final_summary = call_llm(

        "Combine these section summaries into a coherent overview:",

        "\n\n".join(summaries)

    )

    return final_summary

Iterative Refinement

For analysis of long documents, iterate with targeted queries:

Generate an initial summary

2. Query model for what's missing or unclear

3. Retrieve specific sections to fill gaps

4. Generate refined analysis

Monitoring Context Usage

Track these metrics per request:

**Total tokens used** vs. model maximum

**Effective context ratio**: useful tokens / total tokens

**Position of relevant information**: where in the context the key data appeared

**Truncation events**: how often context is being cut off

Conclusion

Effective context management is essential for building reliable LLM applications, regardless of context window size. Prioritize important information, use hierarchical summarization for long conversations, rank documents carefully in RAG systems, and monitor context usage in production. The models with million-token windows are impressive, but they work best when you treat their attention with respect rather than as unlimited storage.