benchmark

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

admin7 months ago04 mins

A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model is raising questions about the company’s transparency and model testing practices. When OpenAI unveiled o3 in December, the company claimed the model could answer just over a fourth of questions on FrontierMath, a challenging set of math problems. That score blew the…

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

admin10 months ago04 mins

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants. That’s why some experts think they’re a promising way to…

Chief Editor

RK

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Crypto

Crypto

Crypto