
Web Documents Reformulated into Synthetic Data to Overcome AI Training Limits
Datology AI has launched BeyondWeb, a new framework utilizing synthetic data to train language models. This approach aims to tackle the shortage of high-quality training data and is claimed to be more efficient than previous methods.
As training budgets for large language models reach trillions of tokens, quality web data becomes increasingly scarce. Datology AI identifies this 'wall of data' as a central challenge and presents BeyondWeb as a solution. The framework restructures existing web documents to be more information-dense, enhances educational tone, and reorganizes content for improved training.
According to Datology AI, BeyondWeb increases accuracy by 5.1 percentage points on 8B parameter models compared to Hugging Face's Cosmopedia and by 2.6 percentage points over Nvidia's Nemotron-CC dataset.
The study also found that BeyondWeb trains significantly faster: 7.7 times quicker than open web data and 2.7 times faster than Nemotron Synthetic. In one test, a 3B parameter model trained on BeyondWeb outperformed an 8B model trained on Cosmopedia using the same token budget.
Researchers explored seven core questions around synthetic data generation. A key takeaway is that diversity is essential for sustained progress. Standard methods may aid early training, but their lack of stylistic variety leads to diminishing returns.
Another finding is that conversational style is underrepresented in web data, comprising less than 2.7 percent, despite chat being a primary use case for LLMs. Adding more conversational data helps, but gains plateau quickly.
Testing different model sizes, the team found that small language models can effectively generate high-quality synthetic data. Moving from 1B to 3B parameters increased data quality by 1.5 percentage points, but improvements flattened at 8B. This suggests that organizations with fewer resources can still create strong synthetic datasets.
The researchers also tested different families of reformulator models and found all produced similarly strong synthetic data. In other words, a model's overall benchmark score doesn't predict the quality of its synthetic data.