Quesma, Inc. has introduced BinaryAudit, an independent, open-source benchmark designed to evaluate whether leading AI models can identify hidden malicious threats in software binaries before they are exploited. Developed in collaboration with world-class reverse engineer Michał "Redford" Kowalczyk, the benchmark reveals both encouraging potential and clear current limitations in AI-powered binary analysis for supply-chain security.
Supply-chain attacks continue to inflict significant damage across industries. Recent high-profile incidents include state-sponsored hijacking of Notepad++ binaries, the Shai Hulud 2.0 campaign compromising thousands of organizations including Fortune 500 companies and governments to steal credentials, and the XZ Utils backdoor inserted by a long-term contributor who gained ownership access. Additional risks stem from vendor-side issues, such as manufacturer-planted code used to disable trains and hardcoded credentials discovered in Cisco devices. These known cases represent only a portion of the broader threat landscape.
Traditional binary reverse engineering remains a reactive, resource-intensive technique reserved for a limited number of specialists and typically performed only after a breach or major incident. AI offers the possibility to transform this approach into a proactive security layer, enabling organizations to inspect binaries routinely—before deployment, during updates, prior to procurement, or even years after initial release. This shift could make supply-chain security more preventive and scalable.
“We were genuinely surprised that today’s LLMs can detect malicious code at all. At current performance levels, it’s an assistant, not a solution,” said Jacek Migdał, CEO of Quesma. “AI binary analysis could be a new layer of defence in supply-chain security. We hope new AI models released in the next 1-2 years will make binary analysis go mainstream. BinaryAudit helps to track and encourage progress in this field.”
BinaryAudit provides a standardized way to measure AI performance in detecting hidden threats within binaries. While current frontier models demonstrate some capability to identify malicious patterns, the 49% success rate of the top performer—coupled with frequent false positives—indicates that AI remains an assistive tool rather than a standalone solution. The benchmark is publicly available to foster ongoing development and improvement in this emerging area of cybersecurity.
About Quesma
Quesma is a technological company that evaluates and tests advanced AI models. It creates benchmarks to evaluate how frontier LLMs perform across critical domains, such as DevOps, security, and database migrations. Quesma is backed by Heartcore Capital, Inovo, Firestreak Ventures, and several angels, including Christina Beedgen, co-founder of Sumo Logic.