AI Writing Scaffolding for 300 Classic Novels
The LongPage dataset provides a hierarchical reasoning framework for AI long-text creation, containing complete reasoning traces of 300 classic novels, from character archetypes to scene decomposition.
The current bottleneck of large language models in long-text creation lies in their lack of hierarchical planning capabilities. The LongPage dataset, with a volume of 400,000 to 600,000 tokens, annotates each classic novel with a multi-level reasoning framework: character archetypes, story arcs, world rules, scene decomposition, and even metadata such as dialogue density and narrative focus.
This is equivalent to equipping AI writing with a chain of thought—clearly demonstrating how to plan character development, advance plotlines, and maintain thematic coherence. During training, it can be fine-tuned using a three-component structure (prompt, thinking process, completed book), and during inference, these traces can be used as creative blueprints.
The production process was not easy. The team spent two months manually designing agent workflows and iteratively validating with Qwen3-32B, consuming $20,000 worth of computing power just to generate the reasoning traces for 300 books. The dataset is now open-sourced on HuggingFace, with plans to expand to a scale of 100,000 books.
Interestingly, the discussion in the comments section reveals that some researchers point out the inherent flaws of LLMs in novel analysis—over-focusing on surface details while neglecting deeper connections, which exactly explains why current AI writing always lacks something. The dataset creators responded that they solve this problem by decomposing tasks to an atomic level and designing separate agents for each reasoning component.
Beyond the technical details, what's more worth pondering is the insight behind this attempt: good stories are not linearly stacked, but like buildings, they need scaffolding. When AI begins to learn how humans construct narrative structures, it might truly be able to write readable long-form works. Of course, the prerequisite is that it must first understand the difference between scenes and acts—just as one commenter complained, current AI can't even distinguish between scenes in the temporal-spatial dimension and scenes as plot units.
发布时间: 2025-09-06 00:11