Introduction


Context window size has grown from 2K tokens in early GPT models to 200K tokens in GPT-4, 200K in Claude 3, and even 1 million tokens in Gemini 1.5 Pro. Despite this growth, effective context management remains critical — models attend less effectively to information in the middle of long contexts, token costs scale with context length, and response latency increases. This guide covers strategies for managing context windows in production.


The "Lost in the Middle" Problem


Research consistently shows that LLMs perform best when relevant information appears at the beginning or end of the context window. Information in the middle is more likely to be ignored or incorrectly processed.


**Performance by position (approximate):**

  • Beginning (first 20%): Best recall, 90%+ accuracy
  • Middle (20-80%): Worst recall, 60-70% accuracy
  • End (last 20%): Good recall, 85%+ accuracy

  • This has direct implications for RAG systems: placing the most relevant documents at the beginning and end of the context improves answer quality even if less relevant documents are included in the middle.


    Context Budgeting


    Treat context like a budget. Allocate tokens deliberately:


    
    Total context: 100K tokens (example)
    
    - System prompt: 2K tokens (2%)
    
    - Conversation history: 10K tokens (10%)
    
    - Retrieved documents: 80K tokens (80%)
    
    - Current query + formatting: 8K tokens (8%)
    
    

    For each allocation, ask:


  • Does every token in the system prompt need to be there?
  • Can conversation history be summarized instead of included verbatim?
  • Are all retrieved documents equally relevant?
  • Can the query be compressed?

  • Strategies for Long Conversations


    Sliding Window


    Keep only the most recent N turns of conversation:


    
    def get_conversation_context(conversation, max_turns=10):
    
        """Keep only the most recent max_turns of conversation."""
    
        trimmed = conversation[-max_turns:]
    
        # Include a summary of earlier turns if needed
    
        if len(conversation) > max_turns:
    
            summary = summarize_conversation(conversation[:-max_turns])
    
            trimmed = [{"role": "system", "content": f"Earlier summary: {summary}"}] + trimmed
    
        return trimmed
    
    

    Conversation Summarization


    Periodically summarize the conversation and replace older messages:


    
    def summarize_and_trim(messages, summary_threshold=20):
    
        if len(messages) <= summary_threshold:
    
            return messages
    
    
    
        to_summarize = messages[:len(messages) - summary_threshold]
    
        summary_prompt = "Summarize the key points from this conversation, preserving any critical information the user has provided:"
    
        summary = call_llm(summary_prompt, to_summarize)
    
    
    
        remaining = messages[-summary_threshold:]
    
        return [{"role": "system", "content": f"Conversation summary: {summary}"}] + remaining
    
    

    Hierarchical Summarization


    For very long conversations, maintain a hierarchy of summaries:


    
    Level 0: Full conversation (raw messages)
    
    Level 1: Hourly summaries
    
    Level 2: Daily summaries
    
    Level 3: Conversation summary (per session)
    
    

    When context is full, replace Level 0 messages with Level 1 summaries, then Level 2, etc.


    RAG Context Management


    Document Ranking for Context


    When multiple documents are retrieved but context is limited:


  • Retrieve many documents (high recall)
  • 2. Rank by relevance using a cross-encoder

    3. Fill context window starting with the most relevant documents

    4. Place the top document at the END of the context (proven best position)

    5. Place the second-best document at the BEGINNING

    6. Fill the middle with remaining documents


    Chunk-Level Re-Ranking


    Instead of ranking entire documents, rank individual chunks. A single relevant paragraph from a marginal document may be more useful than the entire top document:


    
    def fill_context(chunks, max_tokens):
    
        """Select chunks to fill the context window."""
    
        ranked = cross_encoder_rank(chunks, query)
    
        selected = []
    
        token_count = 0
    
    
    
        # Always include the top chunk
    
        top = ranked[0]
    
        if token_count + len(top) <= max_tokens:
    
            selected.append(top)
    
            token_count += len(top)
    
    
    
        # Fill from both ends inward
    
        for chunk in ranked[1:]:
    
            if token_count + len(chunk) <= max_tokens:
    
                selected.append(chunk)
    
                token_count += len(chunk)
    
            else:
    
                break
    
    
    
        # Reorder: best chunk last, second best first, rest in middle
    
        return reorder_for_positioning(selected)
    
    

    Long Document Processing


    Map-Reduce for Very Long Documents


    Split the document, process each section independently, then combine:


    
    def process_long_document(document, chunk_size=4000):
    
        chunks = split_into_chunks(document, chunk_size)
    
        summaries = []
    
    
    
        for chunk in chunks:
    
            summary = call_llm("Summarize this section:", chunk)
    
            summaries.append(summary)
    
    
    
        final_summary = call_llm(
    
            "Combine these section summaries into a coherent overview:",
    
            "\n\n".join(summaries)
    
        )
    
        return final_summary
    
    

    Iterative Refinement


    For analysis of long documents, iterate with targeted queries:


  • Generate an initial summary
  • 2. Query model for what's missing or unclear

    3. Retrieve specific sections to fill gaps

    4. Generate refined analysis


    Monitoring Context Usage


    Track these metrics per request:


  • **Total tokens used** vs. model maximum
  • **Effective context ratio**: useful tokens / total tokens
  • **Position of relevant information**: where in the context the key data appeared
  • **Truncation events**: how often context is being cut off

  • Conclusion


    Effective context management is essential for building reliable LLM applications, regardless of context window size. Prioritize important information, use hierarchical summarization for long conversations, rank documents carefully in RAG systems, and monitor context usage in production. The models with million-token windows are impressive, but they work best when you treat their attention with respect rather than as unlimited storage.