CVEasy AI v2 beats the shipped v1 model on all four product pillars, measured in the exact production serving stack that ships inside the desktop app. The largest jump is in autonomous pentest, and every number was produced fully on-device. Nothing leaves the machine.
A weighted composite of task fidelity and the deployment realities of running AI inside a security program. Higher is better.
The cloud flagships are not scored lower because they reason worse in the abstract. They pay a deployment penalty: a security program cannot send its scan data and prompts to a third party cloud to get an answer. On the job CVEasy AI is built for, local, private, and security specific, v2 leads. Scores come from CVEasy's own CTEM Fitness benchmark, with the open and cloud models run through the same scoring.
The four capabilities CVEasy AI is built to be best-in-class at.
CVE remediation playbooks: correct fixed versions, detection, verification, and copy-pasteable actions.
Severity and exploitability analysis, TRIS risk reasoning, and EPSS / CISA-KEV interpretation.
Module selection, attack-chain coherence, MITRE ATT&CK mapping, tool-call validity, and hard scope adherence.
Portfolio-level narratives: asset posture, attack-path reasoning, board and exec briefings, patch compliance.
The benchmark is a 17-task, held-out suite (autonomous pentest 6, posture 6, analysis 3, remediation 2), authored to be disjoint from the training data so the scores measure generalization, not memorization.
Per-pillar composite score, 0.000–1.000. Higher is better.
This is the number to get excited about. v1's score collapsed in exactly the places that decide whether the model can actually drive an automated pentest: it frequently emitted invalid tool-call JSON and produced weak or wrong MITRE ATT&CK mappings, so it often could not reliably run the tool loop at all.
v2 emits valid, parseable tool calls and maps the correct ATT&CK technique IDs, scored against a validated catalog so hallucinated IDs are penalized, and it improves module selection, attack-chain coherence, and scope adherence on top. In product terms: v1 often can't run an autonomous pentest; v2 can.
This is an internal, fully reproducible test, not a third-party rating.
v1 and v2 run on the same on-device inference engine on Apple Silicon, the exact stack that ships in the desktop app. No advantage given to either side.
Deterministic and reproducible, so each per-pillar delta is real signal, not sampling noise. Reproduced twice with two independent drivers, identical numbers.
No LLM-judge subjectivity. CVSS base-score recomputation, MITRE ATT&CK technique micro-F1, tool-call and runbook JSON structural validation.
Every output is checked against an architecture and identity safety pass, reusing the product's own eval harness, so regressions in guardrails fail the run.
Defensible as capability: ATT&CK technique F1, tool-call and runbook JSON validity, CVSS-vector correctness, and the identity/architecture safety pass-rate. These are structural or factual checks the model cannot game with prose.
Read as format adherence: required must-include phrases and section coverage. They measure whether the output is well-formed and on-template, which is genuinely valuable, but we don't headline them as factual accuracy.
| Metric | CVEasy AI v2 |
|---|---|
| Generation throughput | ~58 tokens/sec (Apple Silicon, M-series) |
| Prompt evaluation | ~390 tokens/sec with system-prompt cache reuse |
| On-disk footprint | 4.7 GB, runs entirely on-device |
| Data egress | Zero. No CVE data, scan results, or prompts leave the machine. |
So nobody is surprised later. We'd rather you trust the small numbers than oversell the big ones.
Sample size. 17 internal tasks. Deterministic at temperature 0, but a 17-case suite is a directional signal, not a population estimate. The shipped composites above are fixed.
Single-shot vs. multi-turn. v2 is excellent at single-shot security generation, remediation, analysis, triage briefs, and executive narratives. Its multi-turn autonomous tool-use loop is improved but remains variable, which is expected for a model this size. The BAS benchmark measures single-turn tool-call correctness, which is solid.
Fitness is not raw intelligence. The CTEM Fitness Score weights task fidelity together with the deployment realities of running AI inside a security program, which is why the cloud flagships land mid-table. A task-fidelity-only comparison with the deployment penalty stripped out is a different test, and on that axis the cloud flagships would score higher. The claim here is not that a local model out-reasons a frontier cloud model in the abstract. It is that it is the better fit for the job, on hardware you control, with nothing leaving your network.
CVEasy AI v2 ships inside the local-first CTEM platform with TRIS™ scoring and built-in BASzy™ attack validation. See the benchmark numbers run against your own estate.
Methodology note. Figures generated from the production evaluation harness comparing the shipped CVEasy AI v1 model against the new fine-tuned v2 model, both served via the same bundled on-device inference engine and scored by the product's deterministic eval scorers. The benchmark is an internal, reproducible 17-task suite, not a third-party certification. © 2026 CVEasy AI. TRIS™ scoring technology is Patent Pending.