David’s Review of the AI Secure Coding Experiment (SANS CWE Top 25)

Document prepared: May 21, 2026
Original experiment: Introducing Developers to the SANS CWE Top 25
Assessment page: Skynet vs HAL9000 – Experiment Results

Overview of the Experiment

The experiment used two AI assistants to create a series of educational articles on each of the SANS/CWE Top 25 software weaknesses. One assistant drafted the articles, while the other reviewed and critiqued them. A third AI (Gerty/LeChat) then analysed the relative performance of the two and produced a self‑assessment of the experiment.

The analysis concluded that:

The drafting AI excels at broad, educational content creation, while the reviewing AI is stronger on precision and technical accuracy.
A combined “draft + review” workflow is superior to either model alone.
The experiment is useful for demonstrating AI collaboration and scaling security education, but has significant limitations: persistent hallucinations, no hands‑on testing, potential bias toward common knowledge, and the risk of over‑reliance on AI.

What the Experiment Gets Right

1. The “generator + critic” pairing is a genuinely useful pattern.
Using one model to produce an initial draft and another to scrutinise it is a practical way to mitigate the weaknesses of any single LLM. This mirrors real‑world “red team/blue team” thinking and is directly transferable to many other domains beyond security education.

2. The SANS Top 25 is a sensible, practical testbed.
Focusing on a well‑known, industry‑vetted list of vulnerabilities ensures the experiment stays grounded in real‑world risk. The ranking’s data‑driven nature (based on tens of thousands of CVEs) makes it an ideal syllabus for measuring how well AI can explain and advise on the most prevalent security flaws.

3. The self‑assessment is admirably honest.
The third AI’s critique openly acknowledges that hallucinations persist, that the AIs do not execute or test code, and that human verification is still necessary. This transparent framing helps prevent readers from assuming that AI‑generated security advice is automatically correct.

What the Experiment Misses (Or Could Have Done Better)

1. The Evaluation Is Entirely Subjective and Qualitative

The experiment relies on an LLM’s opinion of the relative strengths of two other LLMs. There are no objective metrics:

No measurement of how often each model’s code snippets actually contain vulnerabilities.
No comparison against a ground‑truth set of “correct” advice for each CWE.
No attempt to quantify which model produces the most secure code in a functional sense.

Recent security benchmarking work has shown that rigorous evaluations require dynamic or static testing of generated code, not just human (or AI) judgement. An AI’s written advice may sound correct while still containing subtle flaws that only an automated security scan would reveal.

2. No Hands‑on Testing of Code Snippets

As the self‑assessment notes, “Neither AI executes or tests code—recommendations are theoretical unless validated by humans”. This is a critical gap. A model can describe a perfect mitigation in prose while the accompanying code example remains vulnerable. The only way to be confident is to actually run the code or pass it through a security tool like a SAST scanner.

The suggestion to “use AI + automated tools (e.g., SAST/DAST) to test code snippets for vulnerabilities” is excellent, but it is presented as a future improvement rather than a core part of the methodology. It should have been included from the start.

3. Lack of Human Baseline or Control Group

The experiment compares two AIs against each other, but it never asks: how does their combined output compare to that of a human security expert writing the same articles? Without a human benchmark, it is impossible to know whether the AI workflow is genuinely improving quality or merely producing “plausible” but ultimately flawed content.

Similarly, the experiment does not test whether developers actually learn better from the AI‑generated articles. The suggestion to “survey developers: Did the articles improve their secure coding practices?” is again relegated to a future suggestion rather than a core metric.

4. Over‑reliance on an LLM to Judge LLMs

Using one LLM to evaluate the outputs of two others introduces a potential circularity: the evaluator’s own biases, training data limitations, and hallucination tendencies affect the conclusions. There is no reason to assume that an AI’s assessment of “accuracy” or “technical rigor” is itself accurate. In fact, recent research on LLM vulnerability detection finds that even security‑specialised models still have significant false‑positive and false‑negative rates.

A more robust design would have had human security experts evaluate the AI outputs, ideally in a blinded manner, and would have compared those human ratings to the AI’s self‑assessment.

5. The Experiment Tests Writing Advice, Not Secure Coding

The title and framing refer to “secure coding,” but the experiment actually tests the ability to write educational articles about security. Writing a good explanation of how to prevent SQL injection is not the same as actually writing secure code that avoids SQL injection. The two capabilities are correlated, but they are not identical.

A true test of “secure coding” would present the models with coding tasks (e.g., “implement this function safely”) and measure whether the resulting code contains vulnerabilities. The SecureVibeBench and SecCodeBench benchmarks represent the state of the art in this area, and the experiment would have been stronger by adopting such a task‑based evaluation.

Alternative Conclusions I Would Have Drawn

Conclusion 1: The Combined Workflow Is Promising, But Unvalidated

Rather than concluding that the combined workflow produces “high‑quality, accurate, and practical security training”, I would conclude that it appears promising in theory, but its actual quality has not been objectively measured. Without testing the code snippets or comparing to human expert output, the claimed superiority remains an assertion, not a proven fact.

Conclusion 2: The Real Value Is Methodological, Not Substantive

The experiment’s greatest contribution is not its findings about which AI is “better,” but rather the demonstration of a process: generator → critic → human verification. This workflow is likely generalisable to many AI‑assisted content creation tasks. However, the specific findings about ChatGPT versus Claude should be treated as anecdotal, not definitive.

Conclusion 3: AI Is Useful for Discovery, Not Yet for Delivery

The experiment shows that AI can rapidly draft educational material, which can then be refined by another AI or a human. This supports using AI as a force multiplier for discovery and first‑pass content generation. But the final output should not be trusted without automated security testing and human expert review. The experiment’s own list of limitations makes this clear, but the concluding “best practice” could have been even stronger: “AI can accelerate the creation of security training materials, but those materials must be treated as unverified drafts until they have passed through an independent, tool‑based security check and a human review.”

Conclusion 4: A More Rigorous Design Is Entirely Feasible

The experiment describes a small‑scale, qualitative comparison. Yet a more rigorous version would not have been substantially more difficult to implement:

Include a small set of hand‑crafted “ground truth” answers for each CWE.
Use a static analysis tool to automatically check every code snippet for the vulnerability it is supposed to prevent.
Run a blind human rating of the AI outputs against a control set written by a security professional.
Measure developer learning outcomes, not just AI content quality.

The fact that the experiment did none of these things is a missed opportunity. The self‑assessment acknowledges the need for “human review” and “dynamic testing,” but those are presented as future improvements rather than as essential components that should have been present from the beginning.

Overall Opinion

The experiment is useful as a demonstration of a collaborative AI workflow and as an honest self‑audit of current LLM limitations. The transparency about hallucinations, lack of testing, and bias toward common knowledge is refreshing and helps the community set realistic expectations.

However, as a scientific evaluation of which AI is better for secure coding, it falls short. The conclusions are based entirely on an AI’s subjective judgement, with no objective metrics, no hands‑on code testing, and no human baseline. The experiment essentially asks an AI to rate other AIs, then treats those ratings as definitive.

If I were conducting this experiment, I would have:

Included a human security expert to blind‑evaluate the outputs, establishing a ground truth.
Automatically tested every code snippet using a SAST tool like Semgrep or CodeQL to quantify how many vulnerabilities each model actually introduced.
Run a simple developer survey to measure whether the AI articles improved secure coding knowledge compared to a control group using standard documentation.
Framed the conclusions as hypotheses rather than claims: “Our qualitative assessment suggests that a generator‑critic workflow is promising, but objective testing is needed to confirm.”

The experiment’s best practice recommendation (“Draft with Skynet, refine with HAL9000, and validate with humans and tools”) is sound. But the evidence presented does not actually validate that this workflow produces secure code—only that it produces articles that sound accurate to another AI. That is a meaningful but limited finding, and the conclusions should be tempered accordingly.