Hallucination Leaderboard: Is it hallucinating?
Vectara released their hallucination leaderboard using their Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.
But, the eval might be problematic due to couple of things that we highlight bellow.
Focus on Factual Consistency Only: The test only checks if the information in the language model's summary matches the original article. It doesn't care about the overall quality of the summary. For example, a model could just copy parts of the article and never make mistakes, but that doesn't mean it's creating good summaries.
Questionable Evaluation Method: They use another language model to judge if hallucination happens, but there's not much detail about how this judge model works. It's unclear how it decides what's right or wrong, and if its decisions are similar to what a human would think.
Problem with Adding True Facts: If the model adds true information not in the original article (like saying "Madrid, capital of Spain" when the article just mentions "Madrid"), it's not clear if this is considered hallucination.
Penalizing Better Summaries: Models that create more detailed and paraphrased summaries might be unfairly judged. Models that just copy text seem to do better in this test, but that doesn't mean they are better overall.
Get AI smart in 3 minutes or less
Subscribe to our AI newsletter “The Prompt” where we unpack the latest news, trends and tools as simple as chatting with our friends.