Wink Pings

A Small Model Trained for $196 Outperforms GPT-4.1 on Tabular Data Extraction

A new study reveals that a 7-billion parameter model, specialized for document information extraction, outperformed large general models like GPT-4.1 on 1,000 tasks, with a training cost of only $196. The key breakthrough is its ability to solve the problem of scattered information within long documents.

A small model specializing in tables and documents has outperformed giants like GPT-4.1 on the task of extracting structured data.

The research team trained a 7-billion parameter model named Extract-0 for just $196. On 1,000 held-out test tasks, it achieved an average reward score of 0.573 and generated 89% valid JSON, surpassing the performance of GPT-4.1 and other models.

![A document with text titled "Extract-0: A Specialized Language Model for Document Information Extraction." The text includes names Henriques Godoy and São Paulo, Brazil, and mentions arxiv.org/abs/2509.22906. A watermark from X is visible.](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG2iFDDsXgAAoVEZ%3Fformat%3Djpg%26name%3Dlarge)

The key to its success lies in solving a core challenge of long-document information extraction: correlating information scattered across different pages. For example, a name might appear on the first page, while the associated date and amount are on the fifth.

Their method involved generating specialized synthetic training data, which gave the model a memory of information across document 'chunks.' Think of it like sticky notes a librarian uses to remember key points from different sections. This allows the model to connect names, dates, and values that appear far apart.

Technically, the team used Low-Rank Adaptation (LoRA) for fine-tuning, modifying only 0.53% of the model's weights. They then employed Group Relative Policy Optimization (GRPO) with semantic rewards and strict JSON validation. This approach is flexible, accepting different phrasings as long as the meaning is consistent.

This result points to a broader trend: the focus of the AI race may be shifting from 'who is bigger' to 'who is more specialized.' For companies with specific document processing workflows, this could mean no longer paying for general, token-based APIs and instead having a low-cost expert model tailored to their business.

Of course, questions remain, such as how the model handles ambiguous data or edge cases not present in its training data, and its code has not yet been released on GitHub. Nevertheless, this 'small model, big power' concept opens up new possibilities for the efficient application of AI in specific business scenarios, especially on-device AI tasks.

The paper is titled "Extract-0: A Specialized Language Model for Document Information Extraction" and is available on arXiv.

发布时间: 2025-10-06 12:08