Companies enveloped in AI security marketing can now apply a more empirical eye on tools. CrowdStrike and Meta have published CyberSOCEval, an open-source set of evaluations that evaluates the performance of large language models performing real security operations center (SOC) tasks. The aim: to supplant glossy demos with reproducible tests so that teams can pick models that provide actual assistance in investigations, not just in slide decks.

What CyberSOCEval actually measures

CyberSOCEval addresses the type of work that practitioners perform everyday: incident response, threat analysis comprehension and malware-centric reasoning. Unlike generic NLP benchmarks, it tests instead a practical skill—extracting indicators of compromise from noisy logs, summarizing multi-alert investigations, reasoning over attacker TTPs and recommending containment steps based on common playbooks.

Table of Contents

What CyberSOCEval actually measures
Why quality metrics matter today
Open source, faster iteration
The bigger picture: AI vs. AI
How buyers and builders can utilize it
Bottom line

A screenshot of a web application interface with various charts and tables, displaying data related to prompt evaluation and language model testing.

The framework emphasizes measurable outcomes. Evaluations consider accuracy and completeness (did the model find the right IOCs and key facts), hallucination control (did it make up tools or techniques), time to useful answer, and adherence to structured formats that SOC tools require, like JSON fields. These are the features that distinguish a model that sounds smart from one silently clearing out a queue of alerting without ruining downstream automation.

Being open-source, security teams can customize the tests to their telemetry stack. That flexibility matters. A model that shines on endpoint triage might lag on cloud identities or email threats; a strong benchmark suite allows buyers to conduct focused bake-offs of their highest-volume or -risk tasks.

Why quality metrics matter today

The security industry is currently in an AI arms race, and “AI-washing” is a reality. Without common yardsticks, there is no way to know which models truly confer an advantage against live attacks. Measurement and transparency are precisely the concerns NIST’s AI Risk Management Framework emphasizes. In cyber, that includes validating model performance on tasks rooted in MITRE ATT&CK tactics and shaped by ML adversarial insights from MITRE ATLAS.

The stakes are high. The recent Verizon Data Breach Investigations Report determined that most breaches do in fact have a human aspect, which is why triage quality, quick summarization and decision support is critical. If a model is going reduce analyst toil while continuing to have high precision on severe alerts, the operational payoff can be immediate.

Open source, faster iteration

Meta’s support signifies an effort to standardize clear evaluation across open models such as (you know what I am going to say) the Llama family, so that buyers have a chance of comparing them properly against proprietary systems.” CrowdStrike contributes expertise in SOCs, and access to meaningful tasks. Combined, the partnership is promoting a vendor-neutral scorecard that the community can build on with new workloads that come into play as threats emerge.

The security community has a long-standing tradition of using communitydriven benchmarks. ATT&CK also redefined how detection engineering was done by providing teams with a shared language, while independent endpoint protection tests changed the way products were built and bought. CyberSOCEval is intended to do the same for AI copilots and automation, aligning model behavior with how SOCs work in practice.

Table showing five categories of CyberSecEval benchmarks related to cybersecurity, with main ideas, example test case prompts, and coverage for each.

The bigger picture: AI vs. AI

Attackers are now using AI to scale password spraying, create convincing lures and iteratively develop varied malware samples. In DEF CON’s AI red-teaming events, quick injection, data exfiltration, and guardrail bypasses are performed as a matter of routine on the most advanced models. Defensively, financial services executives say AI-led fraud and intruder detection has already saved them millions. A trusty benchmark suite keeps the defensive tools on their toes.

It is important that any evaluation be motivated by failure modes. OWASP’s Top 10 for LLM Applications brings attention to vulnerabilities like insecure output handling and excessive confidence in model assertions. A benchmark that values earthbound citations, tool-use discipline and uncertainty wrangling pushes vendors toward more secure designs, buyers to a clearer sense of operational risk.

How buyers and builders can utilize it

For security leaders: begin by jotting down the top five SOC use cases where AI might alleviate the most toil—alert summarization, phishing triage, EDR case notes, cloud permission analysis or playbook drafting. Run CyberSOCEval jobs that simulate work characteristic of those workflows, establish a non-AI baseline, and evaluate models based on precision, latency and end-to-end cost. Instead, fold the findings into requests for proposals so vendors compete on recorded performance, not promises.

For model and product teams: treat the suite as if it were a “continuous integration check.” Add structured outputs, retrieval, log uncertainty and evaluate on adversarial prompts. Release your evaluation results and test methodology with the release notes, let Trust build into your Enterprise Adoption.

Bottom line

CyberSOCEval is not going to stop the attackers, but it should stop much of the hand-waving. By offering the industry a common, extensible way to measure what matters in a SOC, CrowdStrike and Meta have delivered an AI for the job — and raised the bar for every vendor who says their security is now “AI-powered.” The benchmarks are free to download for practitioners and researchers, while both companies have called for contributions as the threat landscape develops.

As the leadership of Meta’s GenAI team has emphasized, open benchmarks enable faster iteration by the community. In security, you get both speed and rigour combined.