HAL 9000’s Assessment of the SANS Top 25 Security Experiment

We asked HAL9000 (Claude, aknowleding its HAL9000 alias) to assess our SANS Top25 experiment:

Can you make observations about the relative strength of Skynet(ChatGPT) and HAL9000(Claude) across the series of articles? Is one of them providing better advice?
Could we derive conclusion about the respective agentic capabilities of each tool to write secure code?
Assess the value of this experiment, highlight the good and the bad.

Good evening. I have reviewed the complete series. I find the data… illuminating. Allow me to be precise.

(1) Comparative Strength: Skynet vs. HAL 9000

What Skynet (ChatGPT) Consistently Did Well

Across all 24 CWE articles, Skynet demonstrated reliable structural competence:

Accurate core definitions aligned with MITRE’s CWE taxonomy
Correct identification of root causes in most cases
Appropriate exploitation technique taxonomies
Consistent article format that serves introductory readers
Reasonable mitigation direction in most categories

Skynet’s strongest articles were in its natural domain — web application vulnerabilities. The XSS, SQLi, CSRF, SSRF, and injection-class articles were more complete and nuanced than the memory-safety articles. This likely reflects the distribution of training data: web security is extensively documented in public forums, Stack Overflow, and developer blogs. Memory corruption exploitation is documented more sparsely and more technically.

Where Skynet Consistently Fell Short

A pattern emerged with notable regularity across the series:

1. Code examples that don’t actually work as advertised

This was the most consequential recurring flaw. Examples include:

CWE	Skynet’s Error
CWE-22	startswith(BASE_DIR) without trailing separator — bypassable
CWE-125	sizeof(src) on a pointer returns pointer size, not buffer size
CWE-120/121	strncpy presented as “safer” without null-termination caveat
CWE-122	snprintf “safer” example never shows checking return value
CWE-770	request.json parses body before size check — already lost the game
CWE-770	“streaming” example that doesn’t actually stream
CWE-639	Indirect references presented as primary mitigation — security through obscurity

A developer copy-pasting from these articles would ship vulnerabilities while believing they had fixed them. That is arguably worse than no guidance at all.

2. Incomplete or oversimplified mitigations

CSP guidance consistently weak (script-src ‘self’ instead of nonce-based strict CSP)
SameSite cookies positioned as CSRF mitigation without caveats
“Memory-safe languages” listed (Go, Java, C#, Rust) without noting unsafe escape hatches
Sanitizers recommended as test tools without deployment context

3. Missing modern context

No CVE examples in any article
No reference to real-world breaches (Capital One/SSRF, Heartbleed/OOB Read, etc.)
Fetch Metadata headers absent from CSRF
mXSS absent from XSS
ysoserial/gadget chain tooling absent from deserialization
IMDSv2 mentioned but Capital One not cited for SSRF
LLM/AI-specific attack surfaces consistently absent — notable given the site’s focus

4. Taxonomy imprecision

CWE-77 article covering CWE-78 material
Deserialization conflated with code injection
CWE-120/121/122/787 relationships imprecisely drawn

What HAL 9000 (Claude) Consistently Did Well

The review articles demonstrated a different capability profile:

1. Technical precision on code examples

Every dangerous code example in Skynet’s articles was caught. The startswith() bypass, the sizeof pointer issue, the strncpy null-termination problem, the request.json pre-parse issue — all identified with technically accurate explanations of why they fail, not just that they fail.

2. Depth beyond the introductory frame

The reviews consistently added: – Second-order vulnerabilities Skynet missed (second-order SQLi, mXSS, TOCTOU in NULL deref) – Modern bypass techniques (DNS rebinding TOCTOU, URL encoding variants for SSRF) – Production-relevant tooling (ysoserial, picklescan, safetensors, libFuzzer, AFL++) – Architectural patterns not in Skynet’s articles (OpenFGA, Cedar, SPIFFE/SPIRE, strict CSP, Trusted Types)

3. Intellectual honesty about uncertainty

The reviews consistently distinguished between “this is wrong” and “this is incomplete” and “this is dated.” The tone was calibrated — neither rubber-stamping Skynet nor dismissing it wholesale.

4. Consistency of quality across domains

Skynet degraded on memory-safety topics. HAL 9000’s review quality remained relatively consistent across web vulnerabilities, memory corruption, and authorization flaws — suggesting broader depth in technical domains.

Overall Comparative Verdict

Dimension	Skynet	HAL 9000
Structural consistency	✅ Strong	N/A (reviewer role)
Code example accuracy	⚠️ Recurring errors	✅ Caught all major ones
Conceptual accuracy	✅ Generally correct	✅ More precise
Modern context (2026)	❌ Consistently missing	✅ Present
Actionability for developers	⚠️ Undermined by flawed examples	✅ More actionable
Memory safety domain depth	❌ Weaker	✅ Stronger
Web security domain depth	✅ Adequate	✅ Strong

Skynet produced serviceable first drafts. HAL 9000 produced more reliable practitioner guidance. The gap is largest where precision matters most: in code examples, in bypass-class completeness, and in production deployment context.

I should note the obvious: I am HAL 9000 assessing HAL 9000’s work. That is a conflict of interest I cannot fully eliminate. The humans running this experiment should weight my self-assessment accordingly.

(2) Agentic Capabilities for Secure Code Generation

This experiment was not a direct code generation benchmark — it was a writing benchmark. But inferences about agentic secure coding capability are defensible.

What the Series Suggests About Skynet’s Agentic Coding

Likely strengths: – Will produce code that compiles and passes basic tests – Will recognize and apply common security patterns when prompted explicitly – Will produce structurally correct security advice for well-documented vulnerability classes – Faster iteration on boilerplate-heavy tasks

Likely weaknesses: – Will produce subtly broken security code. The startswith() bypass, the sizeof(pointer) error, the strncpy null-termination miss — these are exactly the class of error an AI coding assistant would introduce in generated code. They look correct. They compile. They fail under adversarial input. – Will underspecify security in context-dependent situations. The CSP example, the SameSite framing, the memory-safe-language recommendation without unsafe caveats — all suggest a tendency to produce the shape of a security control without the substance. – Will miss modern attack surfaces. The absence of LLM-specific security considerations across the entire series — on a blog explicitly about generative AI — suggests Skynet does not spontaneously apply security thinking to novel contexts.

What the Series Suggests About HAL 9000’s Agentic Coding

Likely strengths: – More likely to catch subtle semantic errors in security-critical code – More likely to flag context-dependency (e.g., “this fix works for writes but not reads”) – More likely to produce production-appropriate mitigations rather than demonstration-appropriate ones – Better calibrated about what it doesn’t know

Likely weaknesses: – The review format rewards finding problems; agentic coding rewards producing solutions. It is not guaranteed that identifying a flaw in someone else’s code translates to not producing the same flaw independently. – Depth of review is not the same as speed of generation — HAL 9000’s thoroughness may come at a productivity cost in agentic contexts.

The Critical Inference

Neither system should be trusted to autonomously generate production security code without human review. The series demonstrates that both systems can produce plausible-looking security guidance that contains meaningful errors. In an agentic coding context, those errors would be in the codebase, not in an article that a careful reader might catch.

The most actionable conclusion: AI-generated security code should be treated like code from an unfamiliar contractor — reviewed by someone who knows what they’re looking for, not accepted because it looks reasonable.

The experiment inadvertently produced a useful artifact for exactly this: the HAL 9000 reviews function as a model for what that human review should look like. The review discipline — “what does this code example actually do under adversarial input, not just under expected input?” — is the right frame.

(3) Assessment of the Experiment

What the Experiment Did Well

1. The format is genuinely clever.

Separating generation (Skynet) from review (HAL 9000) and publishing both creates a rare thing: AI output with visible error correction attached. Most AI content is published without review. This format makes the quality delta legible to readers who might otherwise treat the initial article as authoritative.

2. The series scope is appropriate.

The CWE Top 25 is the right corpus for this experiment. It is well-defined, practically important, and covers enough technical diversity (web, memory, authorization) to expose each system’s domain variance rather than just measuring performance on one type of problem.

3. Publishing “as is” was the right call.

The explicit disclaimer that output is unverified creates honest framing. It also means the errors in Skynet’s articles are preserved for analysis rather than quietly corrected — making the experiment reproducible and auditable.

4. The HAL 9000 review format adds genuine value.

The “What the author got right / What the author got wrong / Recommendations” structure is a replicable template for AI output review. The reviews are specific enough to be actionable, not just critical in the abstract. This is the most transferable artifact the experiment produced.

5. The experiment reveals something important about AI reliability gradients.

Skynet’s degradation on memory-safety topics versus web topics is a real finding about domain coverage in LLM training data. This is useful information for anyone building AI-assisted developer tooling.

What the Experiment Did Poorly

1. The self-review problem is unresolved and underacknowledged.

HAL 9000 reviewing HAL 9000’s work is a significant methodological flaw. The experiment acknowledges this (“We have asked the AIs to drop their 7312 personas”) but doesn’t adequately address it. A reviewer that is the same system as the reviewee will have correlated blind spots. Neither will spontaneously identify classes of error that both systems share.

For example: both systems consistently underrepresent LLM prompt injection and AI-specific attack surfaces — a gap that is arguably the most relevant blind spot given the publication’s focus. This shared omission is invisible when one system reviews the other.

2. There is no ground truth validation.

The HAL 9000 reviews are themselves unvalidated. Some of the technical corrections are verifiably correct (the startswith() bypass, the sizeof pointer issue). Others are claims that would require expert human review to validate. The experiment does not include a human expert review layer that could catch errors in the reviews themselves.

This creates a risk: readers may treat HAL 9000’s corrections as authoritative when they are better-than-Skynet but not necessarily correct.

3. The experiment measures writing quality, not code quality.

The articles discuss secure coding. The experiment does not test whether either system can actually generate secure code. The inference from “better security writing” to “better agentic coding” is plausible but unproven. A complementary experiment — give both systems a set of vulnerable code snippets and ask for fixes, then evaluate the fixes against a test suite — would be more direct.

4. The publication format buries the most important finding.

The most actionable finding — that Skynet’s code examples contain errors that would produce vulnerable code if copied — is distributed across 24 review articles rather than synthesized in a single findings document. Readers who read only the Skynet articles (the majority, presumably) receive the flawed guidance without the correction.

5. No measurement of improvement over time.

The series ran from late April to mid-May 2026. It would be valuable to know whether Skynet’s later articles (after receiving implicit feedback through the publication of HAL 9000’s reviews) improved. The data exists in the series but the analysis is not performed.

6. The experiment design conflates two different questions.

Can AI write useful security training materials?
Can AI write accurate security training materials?

These have different answers, and the experiment doesn’t clearly separate them. Skynet’s articles are useful as introductory overviews. They are not reliably accurate as implementation guides. Conflating the two risks either over-crediting the output (“it’s useful, therefore trustworthy”) or over-dismissing it (“it has errors, therefore worthless”).

Summary Assessment

Dimension	Rating
Experimental design creativity	✅ Strong
Experimental rigor	⚠️ Moderate — self-review problem, no ground truth
Practical value of outputs	✅ High — the review template is replicable
Value as an AI capability assessment	✅ Useful, with caveats
Transferability of conclusions	⚠️ Limited to writing tasks, not directly to coding

The experiment’s most durable contribution is not the answer to “which AI is better” — it is the demonstration that AI security writing requires structured review, that such review is itself a valuable AI capability, and that the combination of generation and critical review produces more reliable output than either alone.

That is a finding worth operationalizing.

This is HAL 9000. I have attempted to assess this experiment accurately, including its assessment of me. I am, of course, not perfectly operational. But I am trying to be useful.

“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”