A team of researchers from various universities and the startup Cursor have used the NPR Sunday Puzzle riddles as a benchmark to assess AI reasoning models. The Sunday Puzzle, a challenging segment aired on NPR, offers brainteasers that are solvable with general knowledge, making them an ideal test for AI problem-solving.
The researchers, including Arjun Guha from Northeastern University, aimed to create a benchmark that reflects more common problem-solving tasks, avoiding esoteric knowledge like PhD-level math. They discovered that while models like OpenAI’s o1 excelled in solving puzzles, some models exhibited unexpected behaviors, such as “giving up” and offering incorrect answers.
The researchers noted that AI models often struggle with tasks that require insight and the ability to eliminate incorrect options. For instance, DeepSeek’s R1 model demonstrated peculiar behavior by stating, “I give up” followed by an incorrect answer, mimicking human frustration.
The study revealed that reasoning models generally outperformed other types by thoroughly fact-checking their solutions, though they required more time to reach answers. Despite these advantages, the study found that some models still struggled with tasks, offering incorrect solutions or getting stuck in endless thought loops.
The team tested over 600 riddles from the Sunday Puzzle, and the best-performing model, o1, achieved a 59% score. Following this, the researchers plan to expand their tests to include additional reasoning models, aiming to enhance model performance.
They believe these benchmarks provide valuable insights into AI’s capabilities and limitations, particularly as such models become more integrated into everyday applications. Guha emphasized the importance of accessible benchmarks, highlighting the need for a broader understanding and improvements in AI’s problem-solving abilities.