Stanford and Berkeley propose LLM-as-a-Verifier, topping Terminal-Bench and SWE-Bench

ME News reports that on April 14 (UTC+8), according to monitoring by 1M AI News, when AI programming agents execute a single task multiple times, they often produce different solutions—some correct, some incorrect. If the best solution can be automatically selected, the overall success rate can exceed that of a single run. The challenge lies in how to select it: the current mainstream approach is to use another model as a judge to score the solutions (LLM-as-a-Judge), but this method lacks granularity, frequently assigning identical scores to different solutions, making it difficult to distinguish superiority. Stanford AI Lab and Berkeley Sky Computing Lab, in collaboration with NVIDIA, have proposed LLM-as-a-Verifier to improve this selection process. Instead of relying solely on the judge’s final score, the Verifier analyzes the model’s probability distribution across all scoring levels to compute a continuous reward value. Additionally, the judge evaluates multiple times and averages the results to eliminate random bias, while breaking down the overall assessment into three independent dimensions: whether the task requirements are met, whether the output format is correct, and whether any error signals are present. In experiments using Gemini 2.5 Flash as the verifier, single-run accuracy reached 74.7%, compared to 57.0% for the traditional judge; after 16 repetitions, the verifier achieved 77.4%, while the judge reached 70.2%. The traditional judge resulted in ties in 26.5% of comparisons, whereas the verifier achieved a 0% tie rate across all configurations. In practical applications: on Terminal-Bench 2, running GPT-5.4 five times on the same task and randomly selecting one yielded an 81.8% success rate; using the Verifier to select improved this to 86.4%. On SWE-Bench Verified, selecting one solution each from Claude Opus 4.5, Claude Opus 4.6, and Gemini 3 Flash (totaling three solutions), the success rate increased from 76.1% to 77.8% after selection. As of its release on April 9, both results ranked first. The framework has been open-sourced. (Source: BlockBeats)