Skip to content

GAIR-NLP/ContextRot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diagnosing and Mitigating Context Rot in Long-horizon Search

This repository presents the open-source resources associated with the paper Diagnosing and Mitigating Context Rot in Long-horizon Search. We release the complete experimental infrastructure for both Web Search and Local Search settings, including agent scaffolds with seven context management strategies, diagnostic analysis tools, and evaluation benchmarks for studying context rot in long-horizon search agents.

News

  • [2026/06] Our paper is available on arXiv.

Table of Contents

Overview

Extensive context has become the norm as Large Language Models (LLMs) are increasingly deployed in long-horizon tasks. The concern that increasing context length degrades model capabilities, known as context rot, has become a central issue for these applications. In this paper, we focus on deep search scenarios, aiming to investigate the rot phenomenon and its mitigation strategies. By evaluating four flagship open-source models across three benchmarks, we reveal a prevalent but unnoticed rot phenomenon: extensive context causes models to directly give up or prematurely provide uncertain answers, and this issue is exacerbated as the context grows. Through pruning experiments, we demonstrate the relationship between the accumulated context and the rot phenomenon. Furthermore, we investigate mitigating this issue through context management and post-hoc rejection sampling. For context management, we systematically evaluate seven different methods across three categories, based on performance, cost, and impact on context rot, providing clear guidance for strategy selection and usage. For rejection sampling, we develop a rot-aware filtering strategy and demonstrate its effectiveness across three aggregation methods. Finally, we show that these two approaches can be combined for further performance improvements.

Overview of context rot in long-horizon search

Web Search

Web Search Setup

  1. Create an environment

    conda create -n websearch python=3.11 -y
    conda activate websearch
  2. Install dependencies

    pip install openai qwen-agent transformers requests tqdm tiktoken pandas dashscope soundfile jinja2
  3. Configure model and tool services

    export TOKENIZER_PATH="/path/to/local-tokenizer"
    export MODEL_NAME="model-name-served-by-agent-endpoint"
    export MODEL_API_KEY="your-model-api-key"
    export AGENT_URL="https://your-openai-compatible-agent-endpoint/v1"
    
    export SERPER_API_KEY="your-serper-key"
    
    export SUMMARY_API_KEY="your-summary-api-key"
    export SUMMARY_API_BASE="https://your-openai-compatible-summary-endpoint/v1"
    export SUMMARY_MODEL_NAME="/path/or/name/of/summary-model"
    
    export MAX_LLM_CALL_PER_RUN=100
    export CONTEXT_LENGTH=$((198 * 1024))

    SERPER_API_KEY is used by the search tool and can be obtained from Serper. SUMMARY_* is used by the visit/page-summary tool. MODEL_NAME, MODEL_API_KEY, and AGENT_URL configure the main agent model endpoint. TOKENIZER_PATH is only used locally for token counting and context-length control.

    To reduce search API cost, Web Search caches historical search/visit tool-call results. When the same query is requested again, the cached result is returned directly. The cache is stored under ContextRot/websearch/cache/serper_cache.sqlite by default, and you can override it with SERPER_CACHE_DB_PATH.

  4. Prepare input data

    websearch/src/main.py supports built-in dataset names:

    • xbench-deepsearch
    • browsecomp

    The built-in dataset files are stored under ContextRot/data/:

    • ContextRot/data/xbench-deepsearch.json
    • ContextRot/data/browsecomp.json

    You can also pass a custom JSON or JSONL file with --input. Each record should contain:

    {"question": "Question text", "answer": "Reference answer"}

Web Search Usage

Run from the Web Search source directory:

cd ContextRot/websearch/src

python main.py \
  --dataset browsecomp \
  --model "$MODEL_NAME" \
  --tokenizer_path "$TOKENIZER_PATH" \
  --agent react \
  --max_workers 4

By default, outputs are written to ContextRot/websearch/output/<dataset>/<model_name>/<agent>.json, for example ContextRot/websearch/output/browsecomp/Qwen3.5-397B-A17B-FP8/react.json. Use --output /path/to/output.jsonl to override this path.

Add --num_samples N to run only the first N samples. If this argument is omitted, the full dataset is evaluated.

Current Web Search agent choices:

  • react
  • discard
  • summary_semantic
  • summary_length
  • summary_turn
  • keep_k_latest
  • keep_k_latest_wo_any
  • keep_k_latest_wo_reasoning
  • fold

summary_semantic uses summary_agent_semantic_based.py and requires an additional classifier LLM to decide when summarization should be triggered. Configure it with:

export CLASSIFIER_API_BASE="https://your-openai-compatible-classifier-endpoint/v1"
export CLASSIFIER_API_KEY="your-classifier-api-key"
export CLASSIFIER_MODEL_NAME="gpt-oss-120b"

In our experiments, we use gpt-oss-120b as the classifier LLM.

Web Search Analysis

Use the analysis utilities on generated trajectory files:

terminal_state.py is used to classify the terminal state of each trajectory. struggle_behavior.py is used to analyze the struggle pattern of agent trajectories at the process level.

export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"

python ContextRot/websearch/analysis/terminal_state.py \
  --input /path/to/output.jsonl \
  --agent-type react

python ContextRot/websearch/analysis/struggle_behavior.py \
  --input /path/to/output.jsonl \
  --label-csv /path/to/terminal_state.csv

terminal_state.py needs the tokenizer path configured in the Web Search setup section, and OPENAI_MODEL / OPENAI_API_KEY / OPENAI_BASE_URL for the voting-based labeler. Set --agent-type to the agent used for generation. struggle_behavior.py uses the same LLM environment variables plus a label CSV produced by terminal_state.py, and generates a CSV file. In our experiments, we use gpt-oss-120b as the analysis LLM.

Local Search

Local Search Setup

  1. Create an environment

    conda create -n foldagent python=3.11 -y
    conda activate foldagent
  2. Install dependencies

    pip install \
      accelerate codetiming datasets dill hydra-core liger-kernel "numpy<2.0.0" \
      pandas peft "pyarrow>=19.0.0" pybind11 pylatexenc pre-commit "ray[default]" \
      "tensordict>=0.8.0,<=0.10.0,!=0.9.0" torchdata transformers wandb \
      "packaging>=20.0" uvicorn fastapi latex2sympy2_extended math_verify tensorboard openai
  3. Start the local search server

    Start this on the machine that hosts retrieval:

    cd ContextRot/localsearch/src/envs
    
    python search_server.py \
      --model Qwen/Qwen3-Embedding-8B \
      --corpus Tevatron/browsecomp-plus-corpus \
      --corpus-embedding-dataset miaolu3/browsecomp-plus \
      --host 0.0.0.0 \
      --port 8000
  4. Configure evaluation workers

    export LOCAL_SEARCH_URL="http://<search-server-ip>:8000"
    export OPENAI_API_KEY="your-api-key"
    export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
    export TOKENIZER_PATH="/path/to/tokenizer"

Local Search Usage

Run evaluation from localsearch/:

cd ContextRot/localsearch

python src/eval_bc.py \
  --dataset browsecomp-plus \
  --model_name your-model-name \
  --workflow fold \
  --num_workers 32 \
  --context_length 131072 \
  --max_turn 100 \
  --max_session 100

Current Local Search workflows:

  • react
  • discard
  • summary_semantic
  • summary_length
  • summary_turn
  • keep_k_latest
  • keep_k_latest_wo_any
  • keep_k_latest_wo_reasoning
  • fold

Local Search Analysis

Use the analysis utilities on generated trajectory files:

terminal_state.py classifies the terminal state of each trajectory. struggle_score.py analyzes struggle patterns in agent trajectories at the process level.

export TOKENIZER_PATH="/path/to/tokenizer"
export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"

python ContextRot/localsearch/analysis/terminal_state.py \
  --json /path/to/output.json \
  --agent-type react

python ContextRot/localsearch/analysis/struggle_score.py \
  --json /path/to/output.json \
  --label-csv /path/to/terminal_state.csv

Acknowledgment

We sincerely thank the authors of FoldAgent and DeepResearch. The Local Search code in this repository is based on FoldAgent, and the Web Search code is based on DeepResearch.

Citation

Please cite the paper if this repository or the paper is helpful to your work.

@misc{xia2026diagnosingmitigatingcontextrot,
      title={Diagnosing and Mitigating Context Rot in Long-horizon Search}, 
      author={Shijie Xia and Yikun Wang and Zhen Huang and Pengfei Liu},
      year={2026},
      eprint={2606.29718},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2606.29718}, 
}

About

Diagnosing and Mitigating Context Rot in Long-horizon Search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages