A new AI coding challenge has revealed just how far artificial intelligence still has to go in replacing software engineers. The K Prize, created by Databricks and Perplexity co-founder Andy Konwinski and managed by the nonprofit Laude Institute, announced its first winner this week. Eduardo Rocha de Andrade, a Brazilian prompt engineer, took home the $50,000 prize — but he did so with correct answers to only 7.5% of the test questions.
The K Prize is modeled after SWE-Bench, a popular benchmark that evaluates AI on real-world GitHub coding problems. Unlike SWE-Bench, however, the K Prize uses new GitHub issues submitted after models are locked in, ensuring there’s no chance to train on the dataset ahead of time. This approach aims to eliminate “contamination,” a growing concern with AI evaluation benchmarks.
Konwinski said the low score was intentional, as the test was designed to be challenging and favor smaller, open-source models over massive proprietary systems. He also pledged $1 million to the first open-source AI model that can achieve a 90% score.
Current SWE-Bench scores show 75% accuracy on its easier test and 34% on its harder version, raising questions about whether contamination or the new testing approach accounts for the drastic performance gap. Researchers say experiments like the K Prize are essential to improving how AI models are evaluated.
Despite advances in AI-powered coding tools, the results highlight how far these systems are from handling complex, real-world software development on their own. As Konwinski put it, “If we can’t even get more than 10% on a contamination-free benchmark, that’s the reality check for me.”