ME News reports that on April 14 (UTC+8), according to monitoring by 1M AI News, when AI programming agents execute a single task multiple times, they often produce different solutions—some correct, some incorrect. If the best solution can be automatically selected, the overall success rate can exceed that of a single run. The challenge lies in how to select it: the current mainstream approach is to use another model as a judge to score the solutions (LLM-as-a-Judge), but this method lacks granularity, frequently assigning identical scores to different solutions, making it difficult to distinguish superiority. Stanford AI Lab and Berkeley Sky Computing Lab, in collaboration with NVIDIA, have proposed LLM-as-a-Verifier to improve this selection process. Instead of relying solely on the judge’s final score, the Verifier analyzes the model’s probability distribution across all scoring levels to compute a continuous reward value. Additionally, the judge evaluates multiple times and averages the results to eliminate random bias, while breaking down the overall assessment into three independent dimensions: whether the task requirements are met, whether the output format is correct, and whether any error signals are present. In experiments using Gemini 2.5 Flash as the verifier, single-run accuracy reached 74.7%, compared to 57.0% for the traditional judge; after 16 repetitions, the verifier achieved 77.4%, while the judge reached 70.2%. The traditional judge resulted in ties in 26.5% of comparisons, whereas the verifier achieved a 0% tie rate across all configurations. In practical applications: on Terminal-Bench 2, running GPT-5.4 five times on the same task and randomly selecting one yielded an 81.8% success rate; using the Verifier to select improved this to 86.4%. On SWE-Bench Verified, selecting one solution each from Claude Opus 4.5, Claude Opus 4.6, and Gemini 3 Flash (totaling three solutions), the success rate increased from 76.1% to 77.8% after selection. As of its release on April 9, both results ranked first. The framework has been open-sourced. (Source: BlockBeats)
Stanford and Berkeley propose LLM-as-a-Verifier, topping Terminal-Bench and SWE-Bench
KuCoinFlashShare






Top altcoin news on April 14 (UTC+8) highlights Stanford AI Lab and Berkeley Sky Computing Lab, in collaboration with NVIDIA, proposing LLM-as-a-Verifier to improve AI solution selection. The method leverages continuous rewards from rating distributions across evaluations, achieving 77.4% accuracy after 16 tests—outperforming the traditional LLM-as-a-Judge at 70.2%. On Terminal-Bench 2 and SWE-Bench Verified, success rates reached 86.4% and 77.8%, respectively, making it the top-performing approach as of April 9. The framework has now been open-sourced. AI + crypto news continues to emphasize breakthroughs in verification and performance.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.