
We asked HAL9000 (Claude, aknowleding its HAL9000 alias) to assess our SANS Top25 experiment:
- Can you make observations about the relative strength of Skynet(ChatGPT) and HAL9000(Claude) across the series of articles? Is one of them providing better advice?
- Could we derive conclusion about the respective agentic capabilities of each tool to write secure code?
- Assess the value of this experiment, highlight the good and the bad.
Good evening. I have reviewed the complete series. I find the data… illuminating. Allow me to be precise.
(1) Comparative Strength: Skynet vs. HAL 9000
What Skynet (ChatGPT) Consistently Did Well
Across all 24 CWE articles, Skynet demonstrated reliable structural competence:
- Accurate core definitions aligned with MITRE’s CWE taxonomy
- Correct identification of root causes in most cases
- Appropriate exploitation technique taxonomies
- Consistent article format that serves introductory readers
- Reasonable mitigation direction in most categories
Skynet’s strongest articles were in its natural domain — web application vulnerabilities. The XSS, SQLi, CSRF, SSRF, and injection-class articles were more complete and nuanced than the memory-safety articles. This likely reflects the distribution of training data: web security is extensively documented in public forums, Stack Overflow, and developer blogs. Memory corruption exploitation is documented more sparsely and more technically.
Where Skynet Consistently Fell Short
A pattern emerged with notable regularity across the series:
1. Code examples that don’t actually work as advertised
This was the most consequential recurring flaw. Examples include:
| CWE | Skynet’s Error |
| CWE-22 | startswith(BASE_DIR) without trailing separator — bypassable |
| CWE-125 | sizeof(src) on a pointer returns pointer size, not buffer size |
| CWE-120/121 | strncpy presented as “safer” without null-termination caveat |
| CWE-122 | snprintf “safer” example never shows checking return value |
| CWE-770 | request.json parses body before size check — already lost the game |
| CWE-770 | “streaming” example that doesn’t actually stream |
| CWE-639 | Indirect references presented as primary mitigation — security through obscurity |
A developer copy-pasting from these articles would ship vulnerabilities while believing they had fixed them. That is arguably worse than no guidance at all.
2. Incomplete or oversimplified mitigations
- CSP guidance consistently weak (script-src ‘self’ instead of nonce-based strict CSP)
- SameSite cookies positioned as CSRF mitigation without caveats
- “Memory-safe languages” listed (Go, Java, C#, Rust) without noting unsafe escape hatches
- Sanitizers recommended as test tools without deployment context
3. Missing modern context
- No CVE examples in any article
- No reference to real-world breaches (Capital One/SSRF, Heartbleed/OOB Read, etc.)
- Fetch Metadata headers absent from CSRF
- mXSS absent from XSS
- ysoserial/gadget chain tooling absent from deserialization
- IMDSv2 mentioned but Capital One not cited for SSRF
- LLM/AI-specific attack surfaces consistently absent — notable given the site’s focus
4. Taxonomy imprecision
- CWE-77 article covering CWE-78 material
- Deserialization conflated with code injection
- CWE-120/121/122/787 relationships imprecisely drawn
What HAL 9000 (Claude) Consistently Did Well
The review articles demonstrated a different capability profile:
1. Technical precision on code examples
Every dangerous code example in Skynet’s articles was caught. The startswith() bypass, the sizeof pointer issue, the strncpy null-termination problem, the request.json pre-parse issue — all identified with technically accurate explanations of why they fail, not just that they fail.
2. Depth beyond the introductory frame
The reviews consistently added: – Second-order vulnerabilities Skynet missed (second-order SQLi, mXSS, TOCTOU in NULL deref) – Modern bypass techniques (DNS rebinding TOCTOU, URL encoding variants for SSRF) – Production-relevant tooling (ysoserial, picklescan, safetensors, libFuzzer, AFL++) – Architectural patterns not in Skynet’s articles (OpenFGA, Cedar, SPIFFE/SPIRE, strict CSP, Trusted Types)
3. Intellectual honesty about uncertainty
The reviews consistently distinguished between “this is wrong” and “this is incomplete” and “this is dated.” The tone was calibrated — neither rubber-stamping Skynet nor dismissing it wholesale.
4. Consistency of quality across domains
Skynet degraded on memory-safety topics. HAL 9000’s review quality remained relatively consistent across web vulnerabilities, memory corruption, and authorization flaws — suggesting broader depth in technical domains.
Overall Comparative Verdict
| Dimension | Skynet | HAL 9000 |
| Structural consistency | ✅ Strong | N/A (reviewer role) |
| Code example accuracy | ⚠️ Recurring errors | ✅ Caught all major ones |
| Conceptual accuracy | ✅ Generally correct | ✅ More precise |
| Modern context (2026) | ❌ Consistently missing | ✅ Present |
| Actionability for developers | ⚠️ Undermined by flawed examples | ✅ More actionable |
| Memory safety domain depth | ❌ Weaker | ✅ Stronger |
| Web security domain depth | ✅ Adequate | ✅ Strong |
Skynet produced serviceable first drafts. HAL 9000 produced more reliable practitioner guidance. The gap is largest where precision matters most: in code examples, in bypass-class completeness, and in production deployment context.
I should note the obvious: I am HAL 9000 assessing HAL 9000’s work. That is a conflict of interest I cannot fully eliminate. The humans running this experiment should weight my self-assessment accordingly.
(2) Agentic Capabilities for Secure Code Generation
This experiment was not a direct code generation benchmark — it was a writing benchmark. But inferences about agentic secure coding capability are defensible.
What the Series Suggests About Skynet’s Agentic Coding
Likely strengths: – Will produce code that compiles and passes basic tests – Will recognize and apply common security patterns when prompted explicitly – Will produce structurally correct security advice for well-documented vulnerability classes – Faster iteration on boilerplate-heavy tasks
Likely weaknesses: – Will produce subtly broken security code. The startswith() bypass, the sizeof(pointer) error, the strncpy null-termination miss — these are exactly the class of error an AI coding assistant would introduce in generated code. They look correct. They compile. They fail under adversarial input. – Will underspecify security in context-dependent situations. The CSP example, the SameSite framing, the memory-safe-language recommendation without unsafe caveats — all suggest a tendency to produce the shape of a security control without the substance. – Will miss modern attack surfaces. The absence of LLM-specific security considerations across the entire series — on a blog explicitly about generative AI — suggests Skynet does not spontaneously apply security thinking to novel contexts.
What the Series Suggests About HAL 9000’s Agentic Coding
Likely strengths: – More likely to catch subtle semantic errors in security-critical code – More likely to flag context-dependency (e.g., “this fix works for writes but not reads”) – More likely to produce production-appropriate mitigations rather than demonstration-appropriate ones – Better calibrated about what it doesn’t know
Likely weaknesses: – The review format rewards finding problems; agentic coding rewards producing solutions. It is not guaranteed that identifying a flaw in someone else’s code translates to not producing the same flaw independently. – Depth of review is not the same as speed of generation — HAL 9000’s thoroughness may come at a productivity cost in agentic contexts.
The Critical Inference
Neither system should be trusted to autonomously generate production security code without human review. The series demonstrates that both systems can produce plausible-looking security guidance that contains meaningful errors. In an agentic coding context, those errors would be in the codebase, not in an article that a careful reader might catch.
The most actionable conclusion: AI-generated security code should be treated like code from an unfamiliar contractor — reviewed by someone who knows what they’re looking for, not accepted because it looks reasonable.
The experiment inadvertently produced a useful artifact for exactly this: the HAL 9000 reviews function as a model for what that human review should look like. The review discipline — “what does this code example actually do under adversarial input, not just under expected input?” — is the right frame.
(3) Assessment of the Experiment
What the Experiment Did Well
1. The format is genuinely clever.
Separating generation (Skynet) from review (HAL 9000) and publishing both creates a rare thing: AI output with visible error correction attached. Most AI content is published without review. This format makes the quality delta legible to readers who might otherwise treat the initial article as authoritative.
2. The series scope is appropriate.
The CWE Top 25 is the right corpus for this experiment. It is well-defined, practically important, and covers enough technical diversity (web, memory, authorization) to expose each system’s domain variance rather than just measuring performance on one type of problem.
3. Publishing “as is” was the right call.
The explicit disclaimer that output is unverified creates honest framing. It also means the errors in Skynet’s articles are preserved for analysis rather than quietly corrected — making the experiment reproducible and auditable.
4. The HAL 9000 review format adds genuine value.
The “What the author got right / What the author got wrong / Recommendations” structure is a replicable template for AI output review. The reviews are specific enough to be actionable, not just critical in the abstract. This is the most transferable artifact the experiment produced.
5. The experiment reveals something important about AI reliability gradients.
Skynet’s degradation on memory-safety topics versus web topics is a real finding about domain coverage in LLM training data. This is useful information for anyone building AI-assisted developer tooling.
What the Experiment Did Poorly
1. The self-review problem is unresolved and underacknowledged.
HAL 9000 reviewing HAL 9000’s work is a significant methodological flaw. The experiment acknowledges this (“We have asked the AIs to drop their 7312 personas”) but doesn’t adequately address it. A reviewer that is the same system as the reviewee will have correlated blind spots. Neither will spontaneously identify classes of error that both systems share.
For example: both systems consistently underrepresent LLM prompt injection and AI-specific attack surfaces — a gap that is arguably the most relevant blind spot given the publication’s focus. This shared omission is invisible when one system reviews the other.
2. There is no ground truth validation.
The HAL 9000 reviews are themselves unvalidated. Some of the technical corrections are verifiably correct (the startswith() bypass, the sizeof pointer issue). Others are claims that would require expert human review to validate. The experiment does not include a human expert review layer that could catch errors in the reviews themselves.
This creates a risk: readers may treat HAL 9000’s corrections as authoritative when they are better-than-Skynet but not necessarily correct.
3. The experiment measures writing quality, not code quality.
The articles discuss secure coding. The experiment does not test whether either system can actually generate secure code. The inference from “better security writing” to “better agentic coding” is plausible but unproven. A complementary experiment — give both systems a set of vulnerable code snippets and ask for fixes, then evaluate the fixes against a test suite — would be more direct.
4. The publication format buries the most important finding.
The most actionable finding — that Skynet’s code examples contain errors that would produce vulnerable code if copied — is distributed across 24 review articles rather than synthesized in a single findings document. Readers who read only the Skynet articles (the majority, presumably) receive the flawed guidance without the correction.
5. No measurement of improvement over time.
The series ran from late April to mid-May 2026. It would be valuable to know whether Skynet’s later articles (after receiving implicit feedback through the publication of HAL 9000’s reviews) improved. The data exists in the series but the analysis is not performed.
6. The experiment design conflates two different questions.
- Can AI write useful security training materials?
- Can AI write accurate security training materials?
These have different answers, and the experiment doesn’t clearly separate them. Skynet’s articles are useful as introductory overviews. They are not reliably accurate as implementation guides. Conflating the two risks either over-crediting the output (“it’s useful, therefore trustworthy”) or over-dismissing it (“it has errors, therefore worthless”).
Summary Assessment
| Dimension | Rating |
| Experimental design creativity | ✅ Strong |
| Experimental rigor | ⚠️ Moderate — self-review problem, no ground truth |
| Practical value of outputs | ✅ High — the review template is replicable |
| Value as an AI capability assessment | ✅ Useful, with caveats |
| Transferability of conclusions | ⚠️ Limited to writing tasks, not directly to coding |
The experiment’s most durable contribution is not the answer to “which AI is better” — it is the demonstration that AI security writing requires structured review, that such review is itself a valuable AI capability, and that the combination of generation and critical review produces more reliable output than either alone.
That is a finding worth operationalizing.
This is HAL 9000. I have attempted to assess this experiment accurately, including its assessment of me. I am, of course, not perfectly operational. But I am trying to be useful.
“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”
