Synthetic data could help when it comes to evaluating RAGs, researchers find – Blocks and Files

A team of Dutch researchers says synthetic data generated by large language models could ease an impending shortage of high-quality data—at least for evaluating retrieval-augmented generation (RAG) systems. Their early results suggest machine-made benchmarks can be good enough to tune RAGs, provided the synthetic tasks closely mirror real-world ones.

The work, led by Jonas van Elburg of the University of Amsterdam’s IR Lab, will be presented at the SynDAiTE workshop in Porto next week. The event will debate whether synthetic datasets can help offset a projected decline in “fresh” training data: text may become constrained by 2050 and image data could face similar limits by 2060, potentially slowing AI progress.

What the study tested

The researchers asked whether synthetic question–answer (QA) sets produced by LLMs can stand in for human-labeled benchmarks when those are scarce or expensive. The short answer: often, but not always. According to the paper, synthetic benchmarks “provide a reliable signal when tuning retrieval parameters” when the generated tasks match human tasks in both format and difficulty. Where things got shaky was cross-model consistency: results diverged when comparing benchmarks produced by different generator architectures.

“As such, synthetic benchmarks should not be treated as universally reliable, but rather as tools whose validity depends on the alignment between task design, metric choice, and evaluation target,” the authors write.

Why it matters for RAG builders

RAG systems augment a model’s responses by retrieving relevant information from a document corpus—insurance policies, HR manuals, or technical documentation—before generating an answer. That makes them attractive for enterprises that want domain-specific assistants without end-to-end model retraining.

“RAGs are very useful systems, because you can ask questions about a particular topic, as long as you have a stack of documents,” Pegasystems AI lab director and chief scientist Peter van der Putten told Blocks and Files. “The appeal is that you can build these kind of smart chatbots without too much effort. Just find a stack of documents and put a RAG on top of it.”

The catch: you still need trustworthy evaluation to know whether your RAG retrieves the right sources and produces faithful answers. That typically means building and maintaining “golden” test sets—labor-intensive, expensive, and often too narrow to cover real user questions.

“Maintaining such a reference set of golden truth answers is a lot of work,” van der Putten said. “Because it’s a lot of work, there’s also poor coverage.” He added that this bottleneck frequently delays deploying “knowledge buddies” in production.

Synthetic QA data could accelerate that evaluation step. By automatically generating large volumes of questions and ground-truth answers from a known corpus, teams can quickly tune retrieval parameters, compare reranking strategies, and spot regressions—so long as the synthetic tasks faithfully reflect the real ones they aim to measure. Van der Putten said the team has “some hunches” on how to improve generation methods for greater consistency across different LLMs, and expects the best results to come from mixing synthetic and human-curated data.

The looming data crunch

Beyond RAG evaluation, the researchers’ work touches a broader worry: the supply of high-quality, diverse, legally usable data for training frontier models. Van der Putten pointed to several pressures, including copyright constraints, limited availability of sensitive or specialized datasets, and the need for deeper historical coverage in certain domains where records are incomplete.

Those headwinds could amplify demand for synthetic data as a stopgap or supplement. But enterprises shouldn’t confuse “more data” with “solved.” Synthetic corpora introduce new governance obligations.

Data governance will have to evolve

If synthetic data becomes another major category of enterprise information, it must be managed like any other critical asset. That means clear labeling, provenance tracking, and auditability.

“You need to know that this was synthetic data and not real,” van der Putten said. “And you also need to know how it was generated.”

In practice, that implies adding synthetic-aware metadata and lineage to data catalogs; documenting the generator model, prompts, seeds, and post-processing steps; and ensuring downstream systems can differentiate synthetic from human-derived inputs. Data professionals will need new skills that blend MLOps, data engineering, evaluation science, and compliance.

Practical takeaways

  • Use synthetic QA to speed up RAG evaluation, but benchmark against at least a small, high-quality human set to validate alignment.
  • Keep generator choice consistent when comparing runs; different LLMs can yield materially different synthetic benchmarks.
  • Design synthetic tasks to mirror production queries in domain, difficulty, and format to avoid misleading signals.
  • Institutionalize lineage: tag synthetic data, record generation details, and surface this metadata to all consumers.
  • Plan for a hybrid pipeline: synthetic data for breadth and iteration speed, human curation for fidelity and coverage gaps.

The bottom line: synthetic data isn’t a universal substitute for human-labeled benchmarks, but it’s already good enough to make RAG evaluation faster and cheaper—if you respect its limits and invest in governance. With the AI data supply tightening, that disciplined blend of synthetic and human signals may be the difference between shipping robust assistants and getting stuck in evaluation limbo.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Unlock Your Escape: Mastering Asylum Life Codes for Roblox Adventures

Asylum Life Codes (May 2025) As a tech journalist and someone who…

Challenging AI Boundaries: Yann LeCun on Limitations and Potentials of Large Language Models

Exploring the Boundaries of AI: Yann LeCun’s Perspective on the Limitations of…

Unveiling Oracle’s AI Enhancements: A Leap Forward in Logistics and Database Management

Oracle Unveils Cutting-Edge AI Enhancements at Oracle Cloud World Mumbai In an…

Charting New Terrain: Physical Reservoir Computing and the Future of AI

Beyond Electricity: Exploring AI through Physical Reservoir Computing In an era where…