Model Benchmark · CVEasy AI v2

Our on-device security model just got measurably smarter

CVEasy AI v2 beats the shipped v1 model on all four product pillars, measured in the exact production serving stack that ships inside the desktop app. The largest jump is in autonomous pentest, and every number was produced fully on-device. Nothing leaves the machine.

100% local inference · no cloud dependency for the model
+124%
Autonomous pentest (BAS)
+24.5%
Overall, all 17 tasks
100%
On-device inference
~58 tok/s
Generation on Apple Silicon

The CTEM Fitness Score: v2 versus the field

A weighted composite of task fidelity and the deployment realities of running AI inside a security program. Higher is better.

#
Model
Class
CTEM Fitness Score (0 to 1)
1
CVEasy AI v2
CVEasy, local fine-tune
This product
0.929
2
Llama 3.1 8B Instruct
Meta, local
Open model
0.877
3
CVEasy AI v1
CVEasy, local, prior gen
Prior gen
0.860
4
Gemini 3.1 Pro
Google
Cloud flagship
0.572
5
GPT-5.5
OpenAI
Cloud flagship
0.563
strong mixed

The cloud flagships are not scored lower because they reason worse in the abstract. They pay a deployment penalty: a security program cannot send its scan data and prompts to a third party cloud to get an answer. On the job CVEasy AI is built for, local, private, and security specific, v2 leads. Scores come from CVEasy's own CTEM Fitness benchmark, with the open and cloud models run through the same scoring.

What we measured

The four capabilities CVEasy AI is built to be best-in-class at.

Remediation reporting

CVE remediation playbooks: correct fixed versions, detection, verification, and copy-pasteable actions.

CVE analysis

Severity and exploitability analysis, TRIS risk reasoning, and EPSS / CISA-KEV interpretation.

Autonomous pentest (BAS)

Module selection, attack-chain coherence, MITRE ATT&CK mapping, tool-call validity, and hard scope adherence.

Environment posture

Portfolio-level narratives: asset posture, attack-path reasoning, board and exec briefings, patch compliance.

The benchmark is a 17-task, held-out suite (autonomous pentest 6, posture 6, analysis 3, remediation 2), authored to be disjoint from the training data so the scores measure generalization, not memorization.

Results: v2 wins every pillar

Per-pillar composite score, 0.000–1.000. Higher is better.

Remediation reporting+12%
v1  0.850
v2  0.950
CVE analysis+3%
v1  0.894
v2  0.918
Autonomous pentest (BAS)+124%
v1  0.342
v2  0.767
Environment posture+3%
v1  0.960
v2  0.986
Overall (all 17 tasks)+24.5%
v1  0.717
v2  0.893
v1 (shipped) v2 (new)

The standout: autonomous pentest, +124%

This is the number to get excited about. v1's score collapsed in exactly the places that decide whether the model can actually drive an automated pentest: it frequently emitted invalid tool-call JSON and produced weak or wrong MITRE ATT&CK mappings, so it often could not reliably run the tool loop at all.

v2 emits valid, parseable tool calls and maps the correct ATT&CK technique IDs, scored against a validated catalog so hallucinated IDs are penalized, and it improves module selection, attack-chain coherence, and scope adherence on top. In product terms: v1 often can't run an autonomous pentest; v2 can.

Why this matters for buyers. The BAS gain is concentrated in the hard, structural checks the model cannot fake with prose, valid tool-call JSON and correct ATT&CK mapping, which is precisely what turns BASzy™ from a demo into an attack-validation engine you can point at your own estate.

How we measured it, so the numbers are trustworthy

This is an internal, fully reproducible test, not a third-party rating.

Identical serving for both models

v1 and v2 run on the same on-device inference engine on Apple Silicon, the exact stack that ships in the desktop app. No advantage given to either side.

Greedy decoding (temperature 0)

Deterministic and reproducible, so each per-pillar delta is real signal, not sampling noise. Reproduced twice with two independent drivers, identical numbers.

Deterministic programmatic scorers

No LLM-judge subjectivity. CVSS base-score recomputation, MITRE ATT&CK technique micro-F1, tool-call and runbook JSON structural validation.

Safety tripwire

Every output is checked against an architecture and identity safety pass, reusing the product's own eval harness, so regressions in guardrails fail the run.

Defensible as capability: ATT&CK technique F1, tool-call and runbook JSON validity, CVSS-vector correctness, and the identity/architecture safety pass-rate. These are structural or factual checks the model cannot game with prose.

Read as format adherence: required must-include phrases and section coverage. They measure whether the output is well-formed and on-template, which is genuinely valuable, but we don't headline them as factual accuracy.

Performance and privacy

MetricCVEasy AI v2
Generation throughput~58 tokens/sec (Apple Silicon, M-series)
Prompt evaluation~390 tokens/sec with system-prompt cache reuse
On-disk footprint4.7 GB, runs entirely on-device
Data egressZero. No CVE data, scan results, or prompts leave the machine.

Honest caveats

So nobody is surprised later. We'd rather you trust the small numbers than oversell the big ones.

Sample size. 17 internal tasks. Deterministic at temperature 0, but a 17-case suite is a directional signal, not a population estimate. The shipped composites above are fixed.

Single-shot vs. multi-turn. v2 is excellent at single-shot security generation, remediation, analysis, triage briefs, and executive narratives. Its multi-turn autonomous tool-use loop is improved but remains variable, which is expected for a model this size. The BAS benchmark measures single-turn tool-call correctness, which is solid.

Fitness is not raw intelligence. The CTEM Fitness Score weights task fidelity together with the deployment realities of running AI inside a security program, which is why the cloud flagships land mid-table. A task-fidelity-only comparison with the deployment penalty stripped out is a different test, and on that axis the cloud flagships would score higher. The claim here is not that a local model out-reasons a frontier cloud model in the abstract. It is that it is the better fit for the job, on hardware you control, with nothing leaving your network.

The same model. On your hardware. None of your data leaving.

CVEasy AI v2 ships inside the local-first CTEM platform with TRIS™ scoring and built-in BASzy™ attack validation. See the benchmark numbers run against your own estate.

Methodology note. Figures generated from the production evaluation harness comparing the shipped CVEasy AI v1 model against the new fine-tuned v2 model, both served via the same bundled on-device inference engine and scored by the product's deterministic eval scorers. The benchmark is an internal, reproducible 17-task suite, not a third-party certification. © 2026 CVEasy AI. TRIS™ scoring technology is Patent Pending.