ProgramBench uses a not so useful / weird metric like ARC-AGI > headline score of all models -> 0% > looks inside > Opus 4.6 and 4.7 pass on average >50% of tests per task > why? > they only count a task as passed if 100% of tests are successful and as we all know software ships perfectly within the first iteration it's still a very good benchmark, but I guess the headline score will be pretty useless. at least they have other good metrics you can track

Share







Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.