Two days to a working application. Three minutes to a live hotfix. Fifty thousand lines of code with comprehensive tests.
OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry measures itself.