Skip to main content

RAG in 2025

Validation Agent?

While back, my team and I were exploring how to use the most lightweight model possible to perform quick fact-checking before we deliver responses to end users. Our goal was to achieve that final 99.9% accuracy in our overall system. Back then, we were thinking about creating a small, specialized AI assistant whose only job would be to verify facts against our data sources.

This paper from Microsoft Research that takes a completely different approach to this same challenge. Let's break down what makes this research so interesting.

The paper is called "Towards Effective Extraction and Evaluation of Factual Claims" and it tackles a fundamental problem: when large language models create long pieces of text, how do we effectively pull out the factual claims that need to be checked? Even more importantly, how do we determine whether our extraction methods are actually any good?

Think of it like trying to identify specific ingredients in a complex recipe. You need not only to find them but also to make sure you're identifying them correctly and completely.

The authors address this challenge with two main solutions:

First, they create what you might think of as a "grading rubric" for fact extraction. Just as a teacher needs a standardized way to grade student essays, researchers need a consistent framework to evaluate how well different methods extract factual claims. This framework includes automated ways to test these methods at scale, which is crucial when you're dealing with large volumes of text.

Second, they introduce novel ways to measure two critical aspects: coverage and decontextualization. Let me explain these concepts:

  • Coverage is like asking, "Did we find all the important facts?" It measures how completely the extracted claims represent the factual statements in the original text.
  • Decontextualization ensures that each extracted claim can stand on its own. Think of it as making sure a quote makes sense even when you take it out of its original paragraph.

The paper also presents "Claimify," which is their extraction method. What makes Claimify special is its cautious approach to ambiguity. Rather than making questionable extractions, it only pulls out claims when it's highly confident about the correct interpretation. This is like a careful student who would rather leave a question blank than risk writing something incorrect.

The real breakthrough here is that by creating both a standardized evaluation framework and a high-performing extraction method, the researchers are helping the entire field work more effectively and reliably. This is particularly valuable for improving how we identify and verify factual claims in AI-generated text.

Claimify's Key Feature: A notable characteristic of Claimify is its ability to handle ambiguity in the source text. It is designed to extract claims "only when there is high confidence in the correct interpretation of the source text." This suggests Claimify prioritizes accuracy and avoids extracting potentially misinterpretations.


Comments

Popular posts from this blog

2024 Progress...

My team has made considerable advancements in applying various emerging technologies for IMG (Investment Management Group). Predictive Models We have transitioned from conventional methods and refined our approach to using alternative data to more accurately predict the CPI numbers. Our initial approach has not changed by using 2 models (top-down & bottoms-up) for this prediction.   So far we have outperformed both our larger internal team and major banks and dealers in accurately predicting the inflation numbers. Overall roughly 80% accuracy with the last 3 month prediction to be right on the spot.  We have also developed predictive analytics for forecasting prepayment on mortgage-backed securities and predicting macroeconomic regime shifts. Mixed Integer Programming  / Optimization Another area of focus is on numerical optimization to construct a comprehensive portfolio of fixed-income securities for our ETFs and Mutual Funds. This task presents ...

Gemma 3 - Quick Summary & Why this matters

Introduction Despite being labeled the laggard in the language model race behind OpenAI and Anthropic, Google holds two decisive advantages in 2025's evolving AI landscape: unparalleled high-quality data reserves and compute infrastructure that dwarfs even Meta's formidable 600,000 H100 GPUs. As pre-training scaling laws plateau, these assets become critical differentiators. This is especially important in 2025 when everyone is looking for the killer application that can legitimize the research on language models. Combined with DeepMind's elite research talent and visionary leadership, Google possesses a power that competitors ignore at their peril. Gemma is a family of open-weight large language models (LLMs) developed by Google DeepMind and other teams at Google, leveraging the research and technology behind the Gemini models. Released starting in February 2024, Gemma aims to provide state-of-the-art performance in lightweight formats, making advanced AI accessible f...

RL for Small LLM Reasoning: What Works and What Doesn't

Paper  Another interesting research paper. RL for Small LLM Reasoning: What Works and What Doesn't. The paper investigates how reinforcement learning (RL) can improve reasoning capabilities in small language models (LLMs) under strict computational constraints. The researchers experimented with a 1.5-billion-parameter model (DeepSeek-R1-Distill-Qwen-1.5B) on 4 NVIDIA A40 GPUs over 24 hours, adapting the Group Relative Policy Optimization (GRPO) algorithm and creating a curated dataset of mathematical reasoning problems. The performance gains were accomplished using only 7,000 samples at a rough cost of $42. Through three experiments, they discovered that small LLMs can achieve rapid reasoning improvements within 50-100 training steps using limited high-quality data, but performance degrades with prolonged training under strict length constraints. Mixing easy and hard problems improved training stability, while cosine rewards effectively regulated output length. Their Open-RS ...