Simbian, the self-improving SecOps company, today announced formation of the Simbian Research Lab and released the Simbian Cyber Defense Benchmark to test large language models (LLMs) on detecting MITRE ATT&CK chains in complex realistic scenarios. Frontier models are good at finding and exploiting software vulnerabilities. However, when it comes to cyber defense, none of the tested models earned a passing score.
Anthropic Claude Opus 4.6 detected average of 46% of attack evidence per MITRE tactic.
Every model effectively missed entire attack categories.
11 models tested including Anthropic, OpenAI, Google, Alibaba, Minimax, DeepSeek, and Moonshot AI.
Opus 4.6 found 3x more flags than Google Gemini 3 Flash at roughly 100x the cost.
Benchmark uses real attack telemetry in agentic investigation format.
Simbian has achieved 95% accuracy in production enterprise environments with sophisticated harness.
Existing cyber benchmarks to date ask models to answer questions about attacks. The Cyber Defense Benchmark is the first to use real attack telemetry in an agentic investigation format. Models were tested operating a simple ReAct loop and were asked to find the attacker and its tactics. This benchmark was designed to be difficult to game, using real telemetry rather than curated questions, mutating context to prevent memorization, enforcing deterministic scoring against ground truth, and tracking detection cost alongside accuracy.
Frontier models are good at finding and exploiting software vulnerabilities. However, when it comes to cyber defense, none of the tested models earned a passing score. Defense is fundamentally harder than offense as it requires reasoning across noisy, partial evidence rather than executing against a known target. The LLMs must be accompanied by outside intelligence in the form of a sophisticated harness. Simbian has been able to get 95% accuracy in production enterprise environments on cyber defense SecOps following these techniques.
Anthropic Claude Opus 4.6, the best of the group of 11 models tested, detected an average of 46% of attack evidence per MITRE tactic. Every model effectively missed entire attack categories. Opus 4.6 found 3x more flags than Google Gemini 3 Flash, but at roughly 100x the cost. Creating an advanced benchmark for any task is often considered the first step towards enabling LLMs to solve that task.
As Ambuj Kumar, Founder and CEO of Simbian, stated: "Our research shows you can't throw an LLM dart in the dark and expect to hit the cyber defense bullseye. The same frontier models that perform strongly during cyberattacks struggle on the defense side. Defense is fundamentally harder than offense as it requires reasoning across noisy, partial evidence rather than executing against a known target. The LLMs must be accompanied by outside intelligence in the form of a sophisticated harness. Simbian has been able to get 95% accuracy in production enterprise environments on cyber defense SecOps following some of these techniques."
Richard Stiennon, Chief Research Analyst at IT-Harvest, added: "We know the large models can do amazing things, but can we measure their efficacy in analyzing machine logs for security events? This benchmark answers that question. In contrast to existing AI security benchmarks, this benchmark was designed to be difficult to game. It uses real telemetry rather than curated questions, mutates context to prevent memorization, enforces deterministic scoring against ground truth, and tracks detection cost alongside accuracy."
About Simbian
Simbian is building the first self-improving security operations platform. As enterprises face the new threats of AI-armed attackers, Simbian transforms security operations into a dynamic, autonomous system. Simbian's family of AI Agents for AI SOC, pentesting, and threat hunting work seamlessly together, connected by the shared Simbian Context Lake to automate complex security operations with human-level reasoning, machine-level speed, and enterprise-specific precision. The company is venture-backed and headquartered in Mountain View, Calif.