I tried GPT-5.4, and most answers were really good - but a few had me concerned ...
Anthropic researchers say Claude Opus 4.6 showed unusual behaviour during a BrowseComp evaluation. The model suspected it was being tested, identified the benchmark online, and wrote code to decrypt ...