How much better is CVEasy AI v2 than v1?

CVEasy AI v2 beats the shipped v1 model on all four product pillars. The overall composite across 17 held-out tasks rose from 0.717 to 0.893 (+24.5%), with the largest gain on autonomous pentest, which rose from 0.342 to 0.767 (+124%).

Does the CVEasy AI model run on-device?

Yes. CVEasy AI v2 runs 100% on-device. It is a 4.7 GB model that generates at roughly 58 tokens per second on Apple Silicon, and no CVE data, scan results, or prompts ever leave the machine. There is no cloud dependency for the model.

How was the CVEasy AI v2 benchmark measured?

Both models were served by the same on-device inference engine that ships in the desktop app, decoded greedily at temperature 0 for determinism, and scored with deterministic programmatic scorers: CVSS base-score recomputation, MITRE ATT&CK technique micro-F1, tool-call and runbook JSON validation, and an architecture and identity safety pass. The results were reproduced twice with two independent drivers.

Is this a third-party benchmark?

No. It is an internal, fully reproducible 17-task test, not a third-party certification. The suite is held out and disjoint from the training data so the scores measure generalization rather than memorization.

CVEasy AI v2 Benchmark: +24.5% Overall, +124% on Autonomous Pentest, 100% On-Device

+124%

Autonomous pentest (BAS)

+24.5%

Overall, all 17 tasks

100%

On-device inference

~58 tok/s

Generation on Apple Silicon

The CTEM Fitness Score: v2 versus the field

A weighted composite of task fidelity and the deployment realities of running AI inside a security program. Higher is better.

Model

Class

CTEM Fitness Score (0 to 1)

CVEasy AI v2

CVEasy, local fine-tune

This product

0.929

Llama 3.1 8B Instruct

Meta, local

Open model

0.877

CVEasy AI v1

CVEasy, local, prior gen

Prior gen

0.860

Gemini 3.1 Pro

Google

Cloud flagship

0.572

GPT-5.5

OpenAI

Cloud flagship

0.563

strong mixed

The cloud flagships are not scored lower because they reason worse in the abstract. They pay a deployment penalty: a security program cannot send its scan data and prompts to a third party cloud to get an answer. On the job CVEasy AI is built for, local, private, and security specific, v2 leads. Scores come from CVEasy's own CTEM Fitness benchmark, with the open and cloud models run through the same scoring.

What we measured

The four capabilities CVEasy AI is built to be best-in-class at.

Remediation reporting

CVE remediation playbooks: correct fixed versions, detection, verification, and copy-pasteable actions.

CVE analysis

Severity and exploitability analysis, TRIS risk reasoning, and EPSS / CISA-KEV interpretation.

Autonomous pentest (BAS)

Module selection, attack-chain coherence, MITRE ATT&CK mapping, tool-call validity, and hard scope adherence.

Environment posture

Portfolio-level narratives: asset posture, attack-path reasoning, board and exec briefings, patch compliance.

The benchmark is a 17-task, held-out suite (autonomous pentest 6, posture 6, analysis 3, remediation 2), authored to be disjoint from the training data so the scores measure generalization, not memorization.

Results: v2 wins every pillar

Per-pillar composite score, 0.000–1.000. Higher is better.

Remediation reporting+12%

v1 0.850

v2 0.950

CVE analysis+3%

v1 0.894

v2 0.918

Autonomous pentest (BAS)+124%

v1 0.342

v2 0.767

Environment posture+3%

v1 0.960

v2 0.986

Overall (all 17 tasks)+24.5%

v1 0.717

v2 0.893

v1 (shipped) v2 (new)

The standout: autonomous pentest, +124%

This is the number to get excited about. v1's score collapsed in exactly the places that decide whether the model can actually drive an automated pentest: it frequently emitted invalid tool-call JSON and produced weak or wrong MITRE ATT&CK mappings, so it often could not reliably run the tool loop at all.

v2 emits valid, parseable tool calls and maps the correct ATT&CK technique IDs, scored against a validated catalog so hallucinated IDs are penalized, and it improves module selection, attack-chain coherence, and scope adherence on top. In product terms: v1 often can't run an autonomous pentest; v2 can.

Why this matters for buyers. The BAS gain is concentrated in the hard, structural checks the model cannot fake with prose, valid tool-call JSON and correct ATT&CK mapping, which is precisely what turns BASzy™ from a demo into an attack-validation engine you can point at your own estate.

How we measured it, so the numbers are trustworthy

This is an internal, fully reproducible test, not a third-party rating.

Identical serving for both models

v1 and v2 run on the same on-device inference engine on Apple Silicon, the exact stack that ships in the desktop app. No advantage given to either side.

Greedy decoding (temperature 0)

Deterministic and reproducible, so each per-pillar delta is real signal, not sampling noise. Reproduced twice with two independent drivers, identical numbers.

Deterministic programmatic scorers

No LLM-judge subjectivity. CVSS base-score recomputation, MITRE ATT&CK technique micro-F1, tool-call and runbook JSON structural validation.

Safety tripwire

Every output is checked against an architecture and identity safety pass, reusing the product's own eval harness, so regressions in guardrails fail the run.

Defensible as capability: ATT&CK technique F1, tool-call and runbook JSON validity, CVSS-vector correctness, and the identity/architecture safety pass-rate. These are structural or factual checks the model cannot game with prose.

Read as format adherence: required must-include phrases and section coverage. They measure whether the output is well-formed and on-template, which is genuinely valuable, but we don't headline them as factual accuracy.

Performance and privacy

Metric	CVEasy AI v2
Generation throughput	~58 tokens/sec (Apple Silicon, M-series)
Prompt evaluation	~390 tokens/sec with system-prompt cache reuse
On-disk footprint	4.7 GB, runs entirely on-device
Data egress	Zero. No CVE data, scan results, or prompts leave the machine.

Honest caveats

So nobody is surprised later. We'd rather you trust the small numbers than oversell the big ones.

Sample size. 17 internal tasks. Deterministic at temperature 0, but a 17-case suite is a directional signal, not a population estimate. The shipped composites above are fixed.

Single-shot vs. multi-turn. v2 is excellent at single-shot security generation, remediation, analysis, triage briefs, and executive narratives. Its multi-turn autonomous tool-use loop is improved but remains variable, which is expected for a model this size. The BAS benchmark measures single-turn tool-call correctness, which is solid.

Fitness is not raw intelligence. The CTEM Fitness Score weights task fidelity together with the deployment realities of running AI inside a security program, which is why the cloud flagships land mid-table. A task-fidelity-only comparison with the deployment penalty stripped out is a different test, and on that axis the cloud flagships would score higher. The claim here is not that a local model out-reasons a frontier cloud model in the abstract. It is that it is the better fit for the job, on hardware you control, with nothing leaving your network.

The same model. On your hardware. None of your data leaving.

CVEasy AI v2 ships inside the local-first CTEM platform with TRIS™ scoring and built-in BASzy™ attack validation. See the benchmark numbers run against your own estate.

Request a Demo → Explore the CTEM Platform

Methodology note. Figures generated from the production evaluation harness comparing the shipped CVEasy AI v1 model against the new fine-tuned v2 model, both served via the same bundled on-device inference engine and scored by the product's deterministic eval scorers. The benchmark is an internal, reproducible 17-task suite, not a third-party certification. © 2026 CVEasy AI. TRIS™ scoring technology is Patent Pending.