Recursive Language Models: Enabling Large Models to Handle Infinite-Length Texts at Lower Costs
MIT CSAIL introduces Recursive Language Models (RLMs), which allow large models to decompose and recursively process unbounded input through REPL environments. In a 132k token test, RLM paired with GPT-5-mini outperforms GPT-5 by over two times while being more cost-effective.

Handling long texts with large models has always been a challenge. Even with a large context window, when you stuff in hundreds of thousands of tokens, the performance still takes a hit. Alex Zhang and Omar Khattab from MIT CSAIL recently proposed a new approach with Recursive Language Models (RLMs).
The core idea of RLM is simple: instead of letting the model swallow the entire long text at once, it’s like how programmers interact in a REPL environment—decomposing and processing interactively. From the user’s perspective, it still looks like a regular model call, but internally, it recursively generates sub-calls for intermediate computations.

The implementation is essentially a REPL environment similar to Jupyter. The user’s prompt is placed in a Python variable, and the main model interacts with it through REPL loops instead of reading the entire content directly.

How does it perform? On the OOLONG long-text benchmark, RLM paired with GPT-5-mini achieves an accuracy of over 110% higher than GPT-5 when processing 132k token sequences, effectively doubling the performance. More counterintuitively, the cost is even lower.

The reason is that GPT-5 processes the entire context (e.g., 270k tokens) in full, while RLM intelligently selects which parts to feed to the sub-model, resulting in fewer total tokens being processed. This also allows it to easily scale to handle over 1,000 documents.
In the BrowseComp-Plus test, after processing 'prompts' with over 10 million tokens, RLM’s ability to answer composite queries not only didn’t degrade but even improved, even outperforming explicit indexing/retrieval methods.

Of course, there are also challenges. The latency variance is significant, with the lower bound roughly equal to two model call latencies and the upper bound possibly reaching minutes, depending on the execution speed of the code or the complexity of sub-queries. The output mechanism also needs optimization, as the model sometimes confuses whether to output the answer directly or reference variables in the REPL environment.
However, the idea of using a small model with a recursive strategy to beat a large model itself is worth pondering. The paper and more details are available [on the project blog](https://alexzhang13.github.io/blog/2025/rlm/).

发布时间: 2025-10-15 22:32