Wink - AI原生创新，忠于用户，专属智能体验

![Academic paper graphic titled Recursive Language Models detailing a proposed inference strategy for LLMs to decompose and recursively interact with unbounded input context through REPL environments. Authors listed as Alex Zhang and Omar Khattab from MIT CSAIL, published October 15 2025. Two bar charts showing OOLONG tree accuracy on 132k-Context for August, with green bars for RLMs with GPT-5-mini outperforming others like GPT-5 in various length categories from 10% worst to 90% best.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3TuAPxWYAATrbO%3Fformat%3Djpg%26name%3Dlarge)

Handling long texts with large models has always been a challenge. Even with a large context window, when you stuff in hundreds of thousands of tokens, the performance still takes a hit. Alex Zhang and Omar Khattab from MIT CSAIL recently proposed a new approach with Recursive Language Models (RLMs).

The core idea of RLM is simple: instead of letting the model swallow the entire long text at once, it’s like how programmers interact in a REPL environment—decomposing and processing interactively. From the user’s perspective, it still looks like a regular model call, but internally, it recursively generates sub-calls for intermediate computations.

![This is a flowchart showing the process of a language model from a user/API perspective. The chart is divided into two parts, each showing the input-output process. The top part displays a simple flow where 'context' and 'query' are passed as inputs to a module labeled 'Language Model', then output as 'response'. The bottom part shows a similar flow but includes 'RLM' (likely referring to a type of language model or related component), also receiving 'context' and 'query' as inputs and generating 'response'. The chart uses distinct colors to differentiate elements, such as yellow for 'context', pink for 'query' and 'response', green for 'Language Model', and blue for 'RLM'.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3TuJwDWkAAb5Ic%3Fformat%3Dpng%26name%3Dlarge)

The implementation is essentially a REPL environment similar to Jupyter. The user’s prompt is placed in a Python variable, and the main model interacts with it through REPL loops instead of reading the entire content directly.

![This is a flowchart about 'RLM and REPL Environment in Python Notebook'. The chart shows the process from context to query to response. There is a rectangular box labeled 'context' (context), followed by a pink button-style rectangular box labeled 'query' (query). These two elements are connected by an arrow pointing to a green rectangular box containing 'Language Model' (language model) and labeled 'RLM'. To the right of the green rectangle, there is a red button-style rectangular box labeled 'response' (response). The background of the flowchart is white, and the connection lines between modules are black solid lines. Additionally, there are several small text boxes placed at different positions in the flowchart, such as 'Root LM (depth=0)', 'You are trying to answer {query}. Interact with the REPL environment, which contains the context...', 'Root LM Output: execute_code（...）', etc. These small text boxes have different colors, some with a yellow background and white text, others with a pink background and black text.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3TuOYOXQAA3m39%3Fformat%3Djpg%26name%3Dlarge)

How does it perform? On the OOLONG long-text benchmark, RLM paired with GPT-5-mini achieves an accuracy of over 110% higher than GPT-5 when processing 132k token sequences, effectively doubling the performance. More counterintuitively, the cost is even lower.

![This is a chart showing the performance comparison of different methods on the OOLONG benchmark. The chart is divided into two parts, the top part showing the percentage score (left bar chart) and average query cost (right bar chart) for 132k Token Context, and the bottom part showing the percentage score (left bar chart) and average query cost (right bar chart) for 263k Token Context. The names of each method and their corresponding scores or costs are clearly labeled on the chart. The left chart in the top part shows the score performance of GPT-5, GPT-5-mini, RLM(GPT-5-mini), RLM(GPT-5 w/o sub-calls), and ReACT + GPT-5 + BM25. Among them, RLM(GPT-5-mini) achieves the highest score of 64.9%. The right chart compares the average query costs of these methods, showing GPT-5mini with the lowest cost of $0.033. The left chart in the bottom part shows the score performance of the same five methods under 263k Token Context, where RLM(GPT-5-mini) still performs excellently with a score of 51.1%. The right chart compares their average query costs, again showing GPT-5mini with the lowest cost of $0.041.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3TuS8xXgAA1wwK%3Fformat%3Djpg%26name%3Dlarge)

The reason is that GPT-5 processes the entire context (e.g., 270k tokens) in full, while RLM intelligently selects which parts to feed to the sub-model, resulting in fewer total tokens being processed. This also allows it to easily scale to handle over 1,000 documents.

In the BrowseComp-Plus test, after processing 'prompts' with over 10 million tokens, RLM’s ability to answer composite queries not only didn’t degrade but even improved, even outperforming explicit indexing/retrieval methods.

![This is a chart containing two sub-charts. The left sub-chart is titled 'Score on BrowseComp-Plus vs # Context Docs' and shows how the correct answer rate changes as the number of context documents increases. The right sub-chart is titled 'Average API Cost($) per Query vs # Context Docs' and displays how the average API cost per query changes as the number of context documents increases. The left sub-chart has the x-axis representing the number of context documents (ranging from 10 to 1000) and the y-axis representing the correct answer rate (in percentage). Different lines represent different models or methods, including GPT-5, GPT-5 (Truncated), GPT-5 + Pre-query BM25(k=40), ReACT(GPT-5)+BM25, RLM(GPT-5), and RLM(GPT-5) w/o sub-calls. These lines show the performance of each model as the number of context documents increases. The right sub-chart has the same x-axis and y-axis as the left one, representing the number of context documents (ranging from 10 to 1000) and the average API cost (in dollars), respectively. Similar to the left sub-chart, different lines represent different models or methods and also show how the API cost of each model changes as the number of context documents increases. Overall, these two sub-charts together illustrate the trade-offs between performance and cost of various models under different numbers of context documents.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3TuXkuWQAASWku%3Fformat%3Djpg%26name%3Dlarge)

Of course, there are also challenges. The latency variance is significant, with the lower bound roughly equal to two model call latencies and the upper bound possibly reaching minutes, depending on the execution speed of the code or the complexity of sub-queries. The output mechanism also needs optimization, as the model sometimes confuses whether to output the answer directly or reference variables in the REPL environment.

However, the idea of using a small model with a recursive strategy to beat a large model itself is worth pondering. The paper and more details are available [on the project blog](https://alexzhang13.github.io/blog/2025/rlm/).

![This is a screenshot of a document titled 'The Annotated Kolmogorov-Arnold Network (KAN)', with the subtitle 'An annotated guide to the Kolmogorov-Arnold Network'. Authored by Alex Zhang and affiliated with Princeton University (Very Recent Graduate of Princeton University), published on July 1, 2024.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fcard_img%2F1977607071991267328%2FB6LREwNO%3Fformat%3Djpg%26name%3Dlarge)

Wink Pings

Recursive Language Models: Enabling Large Models to Handle Infinite-Length Texts at Lower Costs