CVEasy AI v2 Benchmark: The On-Device Security Model Gets +24.5% Overall and +124% on Autonomous Pentest

We retrained the model that runs inside CVEasy AI, and then we did the part most vendors skip: we benchmarked the new v2 against the shipped v1 in the exact serving stack that runs on a customer's machine, scored it with deterministic checks rather than vibes, and wrote down the caveats. The short version is that v2 wins on every product pillar, the largest jump is in autonomous pentest, and every number was produced fully on-device with nothing leaving the machine.

Headline numbers: +124% on autonomous pentest (BAS), +24.5% overall across 17 held-out tasks, and the top spot on the CTEM Fitness Score, ahead of GPT-5.5, Gemini 3.1 Pro, and Llama 3.1. 100% on-device, running at roughly 58 tokens per second on Apple Silicon. The full report lives on the v2 benchmark page.

The CTEM Fitness Score: v2 versus the field

Model size is not the question a security team should be asking. The question is which model is the best fit to run inside a security program, where it has to be accurate on security work and able to run on hardware you control without shipping your vulnerability data to someone else's cloud. That is what the CTEM Fitness Score measures: a weighted composite of task fidelity and the deployment realities of running AI in a CTEM program.

Model

Class

CTEM Fitness Score (0 to 1)

CVEasy AI v2

CVEasy, local fine-tune

This product

0.929

Llama 3.1 8B Instruct

Meta, local

Open model

0.877

CVEasy AI v1

CVEasy, local, prior gen

Prior gen

0.860

Gemini 3.1 Pro

Google

Cloud flagship

0.572

GPT-5.5

OpenAI

Cloud flagship

0.563

strong mixed

The cloud flagships are not scored lower because they reason worse in the abstract. They pay a deployment penalty, because a security program cannot send its scan data and prompts to a third party cloud to get an answer. On the job CVEasy AI is built for, local, private, and security specific, v2 leads. Scores are from CVEasy's own CTEM Fitness benchmark, with the open and cloud models run through the same scoring.

What we graded, and why it is the right test

CVEasy AI is built to be best-in-class at four things, so those are the four things we measured: remediation reporting, CVE analysis, autonomous pentest, and environment posture. The benchmark is a 17-task held-out suite, authored to be disjoint from the training data so the scores measure whether the model generalizes rather than whether it memorized. That distinction matters, because a model that only looks good on tasks it was trained on tells you nothing about how it behaves against next week's advisory.

v2 beat v1 on all four pillars. Remediation reporting moved from 0.850 to 0.950, CVE analysis from 0.894 to 0.918, environment posture from 0.960 to 0.986, and the overall composite across all 17 tasks moved from 0.717 to 0.893. The pillar that moved the most, and the one worth dwelling on, is autonomous pentest.

Updated

Pillar performance

CVEasy AI composite score per pillar, 0 to 1 · Higher is better

0.85

0.95

Remediation

0.89

0.92

CVE analysis

0.34

0.77

Autonomous pentest

0.96

0.99

Environment posture

0.72

0.89

Overall (17 tasks)

v1 (shipped) v2 (new)

The standout: autonomous pentest, +124%

v1 scored 0.342 on the BAS pillar, and that collapse was concentrated in exactly the checks that decide whether a model can drive an automated pentest rather than just talk about one. v1 frequently emitted invalid tool-call JSON, so it could not reliably run the tool loop, and it produced weak or wrong MITRE ATT&CK mappings. v2 scored 0.767 by fixing precisely those failures: it emits valid, parseable tool calls and maps the correct ATT&CK technique IDs, scored against a validated catalog so hallucinated IDs are penalized rather than rewarded.

In product terms, the difference is binary. v1 often could not run an autonomous pentest at all; v2 can. That is the gain that turns the built-in BASzy™ attack-validation engine from a demo into something you can point at your own estate and trust to map what it found to a real technique.

How we kept the numbers honest

The benchmark is an internal, reproducible test, not a third-party rating, and we say so plainly on the report. Three choices make the deltas trustworthy. Both models run on the same on-device inference engine, the exact stack that ships in the app, so neither side gets an advantage. Decoding is greedy at temperature zero, so the numbers are deterministic and reproducible, and we reproduced them twice with two independent drivers and got identical per-pillar results. And the scorers are deterministic and programmatic, not an LLM grading another LLM: CVSS base-score recomputation, MITRE ATT&CK technique micro-F1, tool-call and runbook JSON validation, plus an architecture and identity safety pass.

What is defensible as capability versus format. The ATT&CK F1, tool-call and runbook JSON validity, CVSS-vector correctness, and the safety pass-rate are structural or factual checks a model cannot game with prose, so we treat those as capability. The required-phrase and section-coverage checks measure whether the output is well-formed and on-template, which is genuinely valuable, but we never relabel them as factual accuracy.

Why on-device is the whole point

The model footprint is 4.7 GB and it runs entirely on the customer's hardware at roughly 58 tokens per second of generation, with no CVE data, scan results, or prompts ever leaving the machine. There is no cloud dependency for the model. For the regulated, air-gapped, and sovereignty-bound teams CVEasy AI is built for, that is not a feature bullet, it is the precondition for using AI on vulnerability data at all. The benchmark exists to prove that staying local does not mean settling for a weaker model.

How CVEasy AI compares to the field

The chart above is v1 versus v2, the only fully apples-to-apples model comparison I can run honestly today. But the question buyers actually ask is how CVEasy AI stacks up against the rest of the security stack. The most important answers do not move when you re-run a benchmark, because they are architecture, not tuning, and on every one of them CVEasy AI is built the way the others are not.

The AI model runs on your hardware

CVEasy AI

100% local, nothing leaves the machine

The field

0% local, scan and prompt data go to a vendor cloud

Risk-scoring intelligence per finding

CVEasy AI

TRIS 12 Layer Scoring

CVSS tools

1 number

CTEM lifecycle stages in one app (Gartner's five)

CVEasy AI

5 of 5

Point tools

about 2 of 5

Built-in attack validation

CVEasy AI

12,868 BASzy attack payloads, included

Scanners

0, BAS is a separate product

These are architecture and capability differences, verifiable from each product's public design. They are also why the cloud flagships land mid-table on the CTEM Fitness Score: the deployment realities, not a claim that a local model out-reasons them in the abstract.

The honest caveats

Seventeen tasks is a directional signal, not a population estimate, and we present it as such. v2 is excellent at single-shot generation, the remediation playbooks, analysis, triage briefs, and executive narratives that make up most of the product surface, while its multi-turn autonomous loop is improved but still variable, which is expected for a model this size.

And the CTEM Fitness Score is two things weighted together: task fidelity and the deployment realities of running AI inside a security program. A raw task-fidelity-only comparison, model reasoning with the deployment penalty stripped out, is a different test, and on that axis the cloud flagships would score higher. I am not claiming a local model out-reasons a frontier cloud model in the abstract. I am showing it is the better fit for the job CVEasy AI does, on hardware you control, with nothing leaving your network.

See it for yourself

The full benchmark, with the per-pillar bar charts, the throughput and privacy numbers, and the complete methodology, is published on the CVEasy AI v2 benchmark page. If you want to see those numbers run against your own environment rather than our fixtures, that is exactly what a demo is for.

CVEasy AI v2: our on-device security model just got measurably smarter

The CTEM Fitness Score: v2 versus the field

What we graded, and why it is the right test

The standout: autonomous pentest, +124%

How we kept the numbers honest

Why on-device is the whole point

How CVEasy AI compares to the field

The honest caveats

See it for yourself

A smarter model, on your hardware, with none of your data leaving.

Related Reading