ChatGPT vs Claude for Secure Coding: Strengths, Reviews, and AI Security Writing Benchmark

We asked Ash120 (Grok, dropping his Ash120 persona) to assess our SANS Top25 experiment:

  • Can you make observations about the relative strength of Skynet(ChatGPT) and HAL9000(Claude) across the series of articles? Is one of them providing better advice?
  • Could we derive conclusion about the respective agentic capabilities of each tool to write secure code?
  • Assess the value of this experiment, highlight the good and the bad.

Skynet (ChatGPT) produces consistently solid, developer-friendly introductory articles that are directionally accurate, well-structured, and practical. HAL9000 (Claude) provides higher-quality critical reviews that identify nuanced gaps, outdated advice, or incomplete mitigations.

(1) Relative Strengths Across the Series

From the sampled articles (intro + CWE-79/XSS, CWE-89/SQLi, CWE-22/Path Traversal) and the pattern in the experiment:

Skynet/ChatGPT strengths:

  • Strong pedagogical structure: Clear explanations of “what it is,” “how it works,” attack flows, visuals, why it persists (e.g., ORM false confidence, framework escape hatches), modern exploitation, framework-specific guidance, secure/unsafe code examples, and defense-in-depth.
  • Current and practical tone: Emphasizes real-world developer pitfalls (e.g., “internal inputs are trusted,” dynamic SQL/ORDER BY issues, DOM sinks in SPAs). Good coverage of context-sensitive encoding for XSS, parameterization for SQLi, canonicalization + validation for path traversal.
  • Actionable for beginners/intermediates: Engaging titles, examples in multiple languages/frameworks, and emphasis on output encoding over input sanitization (a common pitfall).
  • Consistency: Across the long series, it reliably generates complete articles with code snippets, risks, and mitigations without major hallucinations in the sampled cases.

Weaknesses (as flagged by HAL):

  • Occasional oversimplifications or incomplete defenses (e.g., weak CSP example for XSS that could give false confidence; subtle placeholder confusion in SQL examples; missing second-order injection, mXSS, or stricter modern CSP guidance).
  • Sometimes dated framing or missing nuances (e.g., stacked queries prevalence, NoSQL parallels, AI-generated code risks).
  • Can conflate controls (e.g., mixing CSRF-relevant SameSite with XSS).

HAL9000/Claude strengths:

  • Excellent critical analysis: Systematically lists “what’s right” (validating strong parts) then “what’s wrong/weak/misleading” with precise, evidence-based feedback. It pushes toward state-of-the-art practices (strict CSP with nonces/’strict-dynamic’, Trusted Types, allowlists for identifiers, second-order risks, sanitizer bypass awareness).
  • Nuanced, up-to-date recommendations: Fills gaps effectively, emphasizes layered defenses, code-review patterns (grep for dangerous sinks), testing (sqlmap), and modern realities (AI code gen risks, platform differences).
  • Balanced and constructive: Doesn’t just criticize; synthesizes better playbooks for developers.

Weaknesses: As a reviewer, it depends on the input article’s quality. It excels at refinement rather than original generation here.

Overall on advice quality: HAL generally provides better (more precise, complete, and hardened) advice in the reviews. Skynet’s articles are “good enough” and very useful starting points—developers following them would be ahead of most teams—but HAL elevates them by catching the details that turn “mostly secure” into “robust against sophisticated attacks.” Neither is perfect, but the combination (Skynet generates, HAL critiques) is stronger than either alone. This holds across the listed series based on the consistent review format.

(2) Conclusions on Agentic Capabilities for Writing Secure Code

This experiment doesn’t directly test code generation in production contexts (e.g., building full features or fixing live vulnerabilities), but it offers indirect insights:

  • Both are capable of producing correct, actionable secure coding guidance when prompted well. Skynet excels at generating educational content with examples quickly and consistently. This suggests strong potential for drafting secure patterns, boilerplate, or training materials.
  • Neither is fully autonomous or infallible (“agentic” in the sense of reliably producing production-grade secure code without oversight). Skynet can miss nuances, provide incomplete mitigations, or use subtly risky examples. HAL is better at critique and refinement but still operates within the limitations of its training (potential for outdated knowledge or overlooking edge cases).
  • Strength in combination: The setup demonstrates a useful human-AI (or multi-AI) workflow—generate + review/critique. This mirrors real agentic use: LLMs as copilots in a loop with verification, testing, and expert oversight. Direct code gen from either would still require SAST/DAST, manual review, and testing, especially for memory safety (CWE-787, 416, etc.) or complex auth issues.
  • Limitations for true agency: Hallucinations, context loss in long interactions, and lack of real-time execution/testing mean they are tools for augmenting secure development, not replacing engineers. They shine for common, well-documented weaknesses like the Top 25 but may struggle more with novel combinations or emerging tech stacks.

In short: Good at explaining and drafting secure practices; weaker at guaranteed correctness without review. The experiment highlights the value of verification agents like HAL.

(3) Value of the Experiment: Good and Bad

High value overall—this is a creative, transparent, and practical way to benchmark generative AI on security education and recommendations.

Good aspects:

  • Actionable benchmarking: Publishes “as is” outputs + independent reviews, allowing readers to judge quality directly. Long series across diverse CWEs (web, memory, auth, etc.) gives broad coverage.
  • Educational impact: Introduces developers to Top 25 via engaging articles; reviews reinforce better practices and critical thinking.
  • Transparency on AI limits: Explicitly notes outputs may contain inaccuracies and uses the AI personas playfully. Demonstrates real strengths (structure, examples) and weaknesses (nuance gaps).
  • Timely and relevant: Ties into 2025 CWE Top 25 data; highlights persistent issues and modern defenses (CSP, Trusted Types, etc.). Meta-commentary on AI in security training is fitting.
  • Community/series format: Builds a resource library; encourages follow-up discussion.

Bad/limitations:

  • Potential for spreading incomplete advice: If readers only consume Skynet articles without reviews, they get “good but not great” guidance (e.g., weak CSP). The “as is” disclaimer helps, but not all audiences read both.
  • Scale vs. depth: Covering 20+ items quickly risks repetition or variable quality; some reviews flag omissions that could have been caught with better initial prompting.
  • No direct code execution/testing: Advice isn’t validated in running environments, so real-world efficacy isn’t proven.
  • Subjectivity in reviews: HAL’s critiques are strong but represent one model’s view (though often aligned with OWASP/Mitre best practices).
  • Resource intensity: Time-consuming to produce/review/publish daily, though valuable for the experiment.

Net assessment: Strongly positive. It productively explores AI’s role in security training, exposes capabilities/gaps publicly, and creates useful content. Experiments like this are exactly how the field should evaluate tools—empirically, openly, and iteratively. Improvements could include automated testing of code samples or more cross-model generation. This setup (generator + critic) is a model worth replicating for other domains.