IF YOU’VE EVER used a generative artificial intelligence tool, it has lied to you. Probably multiple times.

These recurring fabrications are often called AI hallucinations, and developers are feverishly working to make generative AI tools more reliable by reining in these unfortunate fibs. One of the most popular approaches to reducing AI hallucinations—and one that is quickly growing more popular in Silicon Valley—is called retrieval augmented generation.

The RAG process is quite complicated, but on a basic level it augments your prompts by gathering info from a custom database, and then the large language model generates an answer based on that data. For example, a company could upload all of its HR policies and benefits to a RAG database and have the AI chatbot just focus on answers that can be found in those documents.

So, how is this process different from a standard ChatGPT output? I asked Pablo Arredondo, a vice president of CoCounsel at Thomson Reuters, who has been using the RAG method to develop aspects an AI tool for legal professionals. “Rather than just answering based on the memories encoded during the initial training of the model,” he says, “you utilize the search engine to pull in real documents—whether it’s case law, articles, or whatever you want—and then anchor the response of the model to those documents.”
For instance, we could upload the entirety of WIRED’s history, all of the print magazines and web articles since 1993, to a private database and build a RAG implementation that references these documents when answering reader questions. By giving the AI tool a narrow focus as well as quality information, the RAG-supplemented chatbot would be more adept than a general purpose chatbot at answering questions about WIRED and relevant topics. Would it still make mistakes and sometimes misinterpret the data? Absolutely. But the odds of it fabricating entire articles that never existed would definitely go down.
“You’re rewarding it, in the way that you train the model, to try to write something where every factual claim can be attributed back to a source,” says Patrick Lewis, an AI modeling lead at Cohere who helped develop the concept of RAG a few years ago. If you teach the model to effectively sort through the provided data and use citations in every output, then the AI tool is less likely to make egregious mistakes.

Though, exactly how much RAG reduces AI hallucinations is a point of contention for researchers and developers. Lewis carefully chose his words during our conversation, describing RAG outputs as “low hallucination” rather than hallucination-free. The process is definitely not some panacea that eliminates every mistake made by AI.

During conversations with multiple experts, it became clear that just how much RAG lowers hallucinations depends on two core things: the quality of the overall RAG implementation, and how you decide to define AI hallucinations, a sometimes fuzzy term without a firm definition.
To start off, not all RAGs are of the same caliber. The accuracy of the content in the custom database is critical for solid outputs, but that isn’t the only variable. “It’s not just the quality of the content itself,” says Joel Hron, a global head of AI at Thomson Reuters. “It’s the quality of the search, and retrieval of the right content based on the question.” Mastering each step in the process is critical since one misstep can throw the model completely off.

“Any lawyer who’s ever tried to use a natural language search within one of the research engines will see that there are often instances where semantic similarity leads you to completely irrelevant materials,” says Daniel Ho, a Stanford professor and senior fellow at the Institute for Human-Centered AI. Ho’s research into AI legal tools that rely on RAG found a higher rate of mistakes in outputs than the companies building the models found.

Which brings us to the thorniest question in the discussion: How do you define hallucinations within a RAG implementation? Is it only when the chatbot generates a citation-less output and makes up information? Is it also when the tool may overlook relevant data or misinterpret aspects of a citation?
According to Lewis, hallucinations in a RAG system boil down to whether the output is consistent with what’s found by the model during data retrieval. Though, the Stanford research into AI tools for lawyers broadens this definition a bit by examining whether the output is grounded in the provided data as well as whether it’s factually correct—a high bar for legal professionals who are often parsing complicated cases and navigating complex hierarchies of precedent.
While a RAG system attuned to legal issues is clearly better at answering questions on case law than OpenAI’s ChatGPT or Google’s Gemini, it can still overlook the finer details and make random mistakes. All of the AI experts I spoke with emphasized the continued need for thoughtful, human interaction throughout the process to double check citations and verify the overall accuracy of the results.

Law is an area where there’s a lot of activity around RAG-based AI tools, but the process’s potential is not limited to a single white-collar job. “Take any profession or any business. You need to get answers that are anchored on real documents,” says Arredondo. “So, I think RAG is going to become the staple that is used across basically every professional application, at least in the near to mid-term.” Risk-averse executives seem excited about the prospect of using AI tools to better understand their proprietary data without having to upload sensitive info to a standard, public chatbot.

It’s critical, though, for users to understand the limitations of these tools, and for AI-focused companies to refrain from overpromising the accuracy of their answers. Anyone using an AI tool should still avoid trusting the output entirely, and they should approach its answers with a healthy sense of skepticism even if the answer is improved through RAG.
“Hallucinations are here to stay,” says Ho. “We do not yet have ready ways to really eliminate hallucinations.” Even when RAG reduces the prevalence of errors, human judgment reigns paramount. And that’s no lie.

Lire l’article complet sur : www.wired.com