Subscribe to our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI developments.
Large language models (LLMs) are being increasingly utilized to handle tasks that require vast information processing. Several companies have developed specialized tools using LLMs and information retrieval systems to assist in legal research. However, a recent study by Stanford University researchers reveals that these tools still suffer from significant rates of “hallucinations”—outputs that are demonstrably false.
The study, described by the authors as the first “preregistered empirical evaluation of AI-driven legal research tools,” tested products from major legal research providers and compared them to OpenAI’s GPT-4 using over 200 manually constructed legal queries. They discovered that although hallucinations decreased compared to general-purpose chatbots, the legal AI tools still produced false information at an alarming rate.
Many legal AI tools employ retrieval-augmented generation (RAG) techniques to address the hallucination issue. Unlike plain LLM systems, which depend solely on the knowledge from their training, RAG systems retrieve relevant documents from a knowledge base and use them as a context for responses. Despite its effectiveness in various domains, legal queries often lack a single definitive answer, making RAG’s implementation complex. The researchers note that even with RAG, deciding what information to retrieve can be challenging, especially when dealing with novel or legally indeterminate queries.
The researchers define hallucinations in legal research as responses that are either incorrect or misgrounded—correct facts that do not apply to the specific legal context. A model is considered to produce a hallucination if it makes false statements or incorrectly asserts that a source supports a statement.
The study highlights that document relevance in legal contexts isn’t solely based on text similarity, as many RAG systems operate. Retrieving documents that seem textually relevant but are contextually irrelevant can degrade performance. Previous research showed that general-purpose AI tools are prone to legal hallucinations, leading the researchers to evaluate claims from the legal tech industry about their supposed “hallucination-free” RAG tools. The study revealed that despite marketing claims, legal RAG tools still struggle with hallucinations.
The researchers created a diverse set of legal queries to simulate real-life scenarios and tested three AI-powered legal research tools—Lexis+ AI by LexisNexis, Westlaw AI-Assisted Research, and Ask Practical Law AI by Thomson Reuters. Despite not being open-source, these tools use some form of RAG. The manual review of their outputs showed that while these tools performed better than GPT-4, they still hallucinated on 17-33% of the queries.
The tools also had difficulty with basic legal comprehension tasks requiring detailed analysis of the cited sources. The closed nature of these tools complicates the assessment of their reliability for lawyers. Despite these limitations, AI-assisted legal research still offers value over traditional keyword search methods, especially when used as an initial step.
One of the significant findings is that RAG reduces legal hallucinations compared to general-purpose AI, but it is not a cure-all. Errors can still occur in the RAG pipeline, particularly when inappropriate documents are retrieved, posing unique challenges in legal contexts.
The paper emphasizes the need for transparency and benchmarking in legal AI. Unlike general AI research, legal technology remains largely closed, with little technical information or performance data available from providers. This lack of transparency is a significant risk for legal practitioners.
The study calls for public benchmarking, which has found support from industry providers. For instance, according to a blog post by Mike Dahn from Westlaw Product Management at Thomson Reuters, the company has rigorously tested their tool and recognizes the importance of such evaluations, despite some differences in perceived accuracy rates.
LexisNexis, also responding to the study, agrees on the need for transparency but clarifies that perfection is not their promise; instead, they focus on minimizing hallucinations in linked legal citations. They assert that their tools are designed to complement, not replace, human judgment. Jeff Pfeifer from LexisNexis mentioned ongoing developments to address the issues raised in the study, noting that the improvements are continually being integrated.
Overall, the call for enhanced transparency and benchmarking in legal AI is a crucial step toward ensuring these tools’ reliability and effectiveness, making them a valuable asset rather than a potential liability for legal professionals.