Sponsored by Looka AI – Exclusive lifetime deal

AI Coding Challenge Exposes How Far Models Are From Replacing Engineers

A new AI coding challenge has revealed just how far artificial intelligence still has to go in replacing software engineers. The K Prize, created by Databricks and Perplexity co-founder Andy Konwinski and managed by the nonprofit Laude Institute, announced its first winner this week. Eduardo Rocha de Andrade, a Brazilian prompt engineer, took home the $50,000 prize — but he did so with correct answers to only 7.5% of the test questions.

The K Prize is modeled after SWE-Bench, a popular benchmark that evaluates AI on real-world GitHub coding problems. Unlike SWE-Bench, however, the K Prize uses new GitHub issues submitted after models are locked in, ensuring there’s no chance to train on the dataset ahead of time. This approach aims to eliminate “contamination,” a growing concern with AI evaluation benchmarks.

Konwinski said the low score was intentional, as the test was designed to be challenging and favor smaller, open-source models over massive proprietary systems. He also pledged $1 million to the first open-source AI model that can achieve a 90% score.

Current SWE-Bench scores show 75% accuracy on its easier test and 34% on its harder version, raising questions about whether contamination or the new testing approach accounts for the drastic performance gap. Researchers say experiments like the K Prize are essential to improving how AI models are evaluated.

Despite advances in AI-powered coding tools, the results highlight how far these systems are from handling complex, real-world software development on their own. As Konwinski put it, “If we can’t even get more than 10% on a contamination-free benchmark, that’s the reality check for me.”

Facebook
X
LinkedIn
Pinterest
Reddit

Subscribe and get Cheat Sheet of Super Power AI prompts for FREE !

Limited Time Only!

Embark on your AI journey by securing your copy today!