Diagnosing and Mitigating Context Rot in Long-horizon Search

This repository presents the open-source resources associated with the paper Diagnosing and Mitigating Context Rot in Long-horizon Search. We release the complete experimental infrastructure for both Web Search and Local Search settings, including agent scaffolds with seven context management strategies, diagnostic analysis tools, and evaluation benchmarks for studying context rot in long-horizon search agents.

News

[2026/06] Our paper is available on arXiv.

Overview

Extensive context has become the norm as Large Language Models (LLMs) are increasingly deployed in long-horizon tasks. The concern that increasing context length degrades model capabilities, known as context rot, has become a central issue for these applications. In this paper, we focus on deep search scenarios, aiming to investigate the rot phenomenon and its mitigation strategies. By evaluating four flagship open-source models across three benchmarks, we reveal a prevalent but unnoticed rot phenomenon: extensive context causes models to directly give up or prematurely provide uncertain answers, and this issue is exacerbated as the context grows. Through pruning experiments, we demonstrate the relationship between the accumulated context and the rot phenomenon. Furthermore, we investigate mitigating this issue through context management and post-hoc rejection sampling. For context management, we systematically evaluate seven different methods across three categories, based on performance, cost, and impact on context rot, providing clear guidance for strategy selection and usage. For rejection sampling, we develop a rot-aware filtering strategy and demonstrate its effectiveness across three aggregation methods. Finally, we show that these two approaches can be combined for further performance improvements.

Web Search

Web Search Setup

Create an environment

conda create -n websearch python=3.11 -y
conda activate websearch

Install dependencies

pip install openai qwen-agent transformers requests tqdm tiktoken pandas dashscope soundfile jinja2

Configure model and tool services
```
export TOKENIZER_PATH="/path/to/local-tokenizer"
export MODEL_NAME="model-name-served-by-agent-endpoint"
export MODEL_API_KEY="your-model-api-key"
export AGENT_URL="https://your-openai-compatible-agent-endpoint/v1"

export SERPER_API_KEY="your-serper-key"

export SUMMARY_API_KEY="your-summary-api-key"
export SUMMARY_API_BASE="https://your-openai-compatible-summary-endpoint/v1"
export SUMMARY_MODEL_NAME="/path/or/name/of/summary-model"

export MAX_LLM_CALL_PER_RUN=100
export CONTEXT_LENGTH=$((198 * 1024))
```
SERPER_API_KEY is used by the search tool and can be obtained from Serper. SUMMARY_* is used by the visit/page-summary tool. MODEL_NAME, MODEL_API_KEY, and AGENT_URL configure the main agent model endpoint. TOKENIZER_PATH is only used locally for token counting and context-length control.

To reduce search API cost, Web Search caches historical search/visit tool-call results. When the same query is requested again, the cached result is returned directly. The cache is stored under ContextRot/websearch/cache/serper_cache.sqlite by default, and you can override it with SERPER_CACHE_DB_PATH.
Prepare input data

websearch/src/main.py supports built-in dataset names:
- xbench-deepsearch
- browsecomp
The built-in dataset files are stored under ContextRot/data/:
- ContextRot/data/xbench-deepsearch.json
- ContextRot/data/browsecomp.json
You can also pass a custom JSON or JSONL file with --input. Each record should contain:
```
{"question": "Question text", "answer": "Reference answer"}
```

Web Search Usage

Run from the Web Search source directory:

cd ContextRot/websearch/src

python main.py \
  --dataset browsecomp \
  --model "$MODEL_NAME" \
  --tokenizer_path "$TOKENIZER_PATH" \
  --agent react \
  --max_workers 4

By default, outputs are written to ContextRot/websearch/output/<dataset>/<model_name>/<agent>.json, for example ContextRot/websearch/output/browsecomp/Qwen3.5-397B-A17B-FP8/react.json. Use --output /path/to/output.jsonl to override this path.

Add --num_samples N to run only the first N samples. If this argument is omitted, the full dataset is evaluated.

Current Web Search agent choices:

react
discard
summary_semantic
summary_length
summary_turn
keep_k_latest
keep_k_latest_wo_any
keep_k_latest_wo_reasoning
fold

summary_semantic uses summary_agent_semantic_based.py and requires an additional classifier LLM to decide when summarization should be triggered. Configure it with:

export CLASSIFIER_API_BASE="https://your-openai-compatible-classifier-endpoint/v1"
export CLASSIFIER_API_KEY="your-classifier-api-key"
export CLASSIFIER_MODEL_NAME="gpt-oss-120b"

In our experiments, we use gpt-oss-120b as the classifier LLM.

Web Search Analysis

Use the analysis utilities on generated trajectory files:

terminal_state.py is used to classify the terminal state of each trajectory. struggle_behavior.py is used to analyze the struggle pattern of agent trajectories at the process level.

export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"

python ContextRot/websearch/analysis/terminal_state.py \
  --input /path/to/output.jsonl \
  --agent-type react

python ContextRot/websearch/analysis/struggle_behavior.py \
  --input /path/to/output.jsonl \
  --label-csv /path/to/terminal_state.csv

terminal_state.py needs the tokenizer path configured in the Web Search setup section, and OPENAI_MODEL / OPENAI_API_KEY / OPENAI_BASE_URL for the voting-based labeler. Set --agent-type to the agent used for generation. struggle_behavior.py uses the same LLM environment variables plus a label CSV produced by terminal_state.py, and generates a CSV file. In our experiments, we use gpt-oss-120b as the analysis LLM.

Local Search

Local Search Setup

Create an environment

conda create -n foldagent python=3.11 -y
conda activate foldagent

Install dependencies

pip install \
  accelerate codetiming datasets dill hydra-core liger-kernel "numpy<2.0.0" \
  pandas peft "pyarrow>=19.0.0" pybind11 pylatexenc pre-commit "ray[default]" \
  "tensordict>=0.8.0,<=0.10.0,!=0.9.0" torchdata transformers wandb \
  "packaging>=20.0" uvicorn fastapi latex2sympy2_extended math_verify tensorboard openai

Start the local search server

Start this on the machine that hosts retrieval:

cd ContextRot/localsearch/src/envs

python search_server.py \
  --model Qwen/Qwen3-Embedding-8B \
  --corpus Tevatron/browsecomp-plus-corpus \
  --corpus-embedding-dataset miaolu3/browsecomp-plus \
  --host 0.0.0.0 \
  --port 8000

Configure evaluation workers

export LOCAL_SEARCH_URL="http://<search-server-ip>:8000"
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export TOKENIZER_PATH="/path/to/tokenizer"

Local Search Usage

Run evaluation from localsearch/:

cd ContextRot/localsearch

python src/eval_bc.py \
  --dataset browsecomp-plus \
  --model_name your-model-name \
  --workflow fold \
  --num_workers 32 \
  --context_length 131072 \
  --max_turn 100 \
  --max_session 100

Current Local Search workflows:

react
discard
summary_semantic
summary_length
summary_turn
keep_k_latest
keep_k_latest_wo_any
keep_k_latest_wo_reasoning
fold

Local Search Analysis

Use the analysis utilities on generated trajectory files:

terminal_state.py classifies the terminal state of each trajectory. struggle_score.py analyzes struggle patterns in agent trajectories at the process level.

export TOKENIZER_PATH="/path/to/tokenizer"
export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"

python ContextRot/localsearch/analysis/terminal_state.py \
  --json /path/to/output.json \
  --agent-type react

python ContextRot/localsearch/analysis/struggle_score.py \
  --json /path/to/output.json \
  --label-csv /path/to/terminal_state.csv

Acknowledgment

We sincerely thank the authors of FoldAgent and DeepResearch. The Local Search code in this repository is based on FoldAgent, and the Web Search code is based on DeepResearch.

Citation

Please cite the paper if this repository or the paper is helpful to your work.

@misc{xia2026diagnosingmitigatingcontextrot,
      title={Diagnosing and Mitigating Context Rot in Long-horizon Search}, 
      author={Shijie Xia and Yikun Wang and Zhen Huang and Pengfei Liu},
      year={2026},
      eprint={2606.29718},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2606.29718}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
images		images
localsearch		localsearch
websearch		websearch
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diagnosing and Mitigating Context Rot in Long-horizon Search

News

Table of Contents

Overview

Web Search

Web Search Setup

Web Search Usage

Web Search Analysis

Local Search

Local Search Setup

Local Search Usage

Local Search Analysis

Acknowledgment

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Diagnosing and Mitigating Context Rot in Long-horizon Search

News

Table of Contents

Overview

Web Search

Web Search Setup

Web Search Usage

Web Search Analysis

Local Search

Local Search Setup

Local Search Usage

Local Search Analysis

Acknowledgment

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages