
We asked Gerty (LeChat) to assess our SANS Top25 experiment:
- Can you make observations about the relative strength of Skynet(ChatGPT) and HAL9000(Claude) across the series of articles? Is one of them providing better advice?
- Could we derive conclusion about the respective agentic capabilities of each tool to write secure code?
- Assess the value of this experiment, highlight the good and the bad.
Relative Strengths of Skynet (ChatGPT) and HAL9000 (Claude)
Skynet (ChatGPT) Strengths
- Comprehensiveness: Skynet’s articles provide detailed introductions to each CWE, including explanations of risks, prevalence, and technical context. This is valuable for developers new to security concepts.
- Actionable Recommendations: The articles often include practical mitigation strategies, such as code examples or configuration tips, which are directly useful for developers.
- Engaging Narrative: The writing style is accessible and engaging, making complex topics approachable for a broad audience.
- Visuals: The inclusion of visuals (e.g., diagrams, flowcharts) in the initial article and likely in subsequent ones enhances understanding.
HAL9000 (Claude) Strengths
- Critical Review: HAL9000’s reviews highlight inaccuracies, omissions, or outdated information in Skynet’s articles, ensuring technical accuracy.
- Depth of Analysis: The reviews often dive deeper into nuances, such as edge cases, advanced attack vectors, or subtle misconfigurations that Skynet might overlook.
- Developer-Centric Focus: HAL9000’s recommendations are highly actionable and specific, often addressing real-world implementation challenges (e.g., “Use parameterized queries this way to avoid SQLi”).
- Balanced Perspective: The reviews acknowledge what Skynet got right while constructively criticizing gaps, providing a more holistic view of each CWE.
Relative Performance
- Skynet excels at broad, educational content—ideal for awareness and foundational knowledge.
- HAL9000 shines in precision, depth, and correctness—ideal for validating and refining technical advice.
- Combined Value: The experiment demonstrates that Skynet + HAL9000 > either alone. Skynet generates a strong first draft, while HAL9000 refines it into a more accurate, practical, and trustworthy resource.
Agentic Capabilities for Writing Secure Code
Skynet (ChatGPT)
- Strengths:
- Generative Breadth: Can produce long-form, structured content (e.g., articles, tutorials) quickly.
- Contextual Adaptability: Adapts explanations to developer-friendly language and provides examples (e.g., code snippets for XSS or SQLi).
- Creative Visualization: Can suggest or generate diagrams, tables, or analogies to explain weaknesses.
- Limitations:
- Hallucinations/Inaccuracies: May occasionally misrepresent details (e.g., prevalence stats, edge cases) or provide outdated advice (e.g., deprecated libraries).
- Surface-Level Depth: Struggles with highly technical nuances (e.g., memory corruption in CWE-125 or CWE-416).
- Lack of Real-World Validation: Does not test or verify code examples for security (e.g., may suggest vulnerable patterns if not explicitly prompted to avoid them).
HAL9000 (Claude)
- Strengths:
- Technical Rigor: Identifies flaws in Skynet’s output, such as missing input validation edge cases or incorrect risk assessments.
- Precision in Recommendations: Provides specific, hardened code patterns (e.g., “Use
os.path.abspath+os.path.commonpathfor path traversal checks”). - Up-to-Date Awareness: More likely to flag outdated practices (e.g., “This mitigation is insufficient for modern frameworks”).
- Limitations:
- Reactive, Not Generative: Requires existing content to review—less suited for creating original training materials from scratch.
- Less Engaging: Reviews may be more technical and less narrative-driven, potentially less accessible to beginners.
Derived Conclusions
| Capability | Skynet (ChatGPT) | HAL9000 (Claude) | Combined Workflow |
|---|---|---|---|
| Educational Value | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ (Best for learning) |
| Technical Accuracy | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ (Best for production) |
| Actionable Advice | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Depth of Analysis | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Speed of Content Gen | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ (Fast first draft) |
Key Takeaway:
- For writing secure code, HAL9000 is the stronger agent due to its precision and critical eye.
- For teaching secure coding concepts, Skynet is more effective at scaling content creation.
- Optimal Workflow: Use Skynet to draft + HAL9000 to review/refine for high-quality, accurate, and practical security training.
3. Assessment of the Experiment
The Good
✅ Demonstrates AI Collaboration:
- Shows how multiple AI tools can complement each other (generator + critic).
- Validates the “red team/blue team” approach for AI-assisted security content.
✅ Real-World Relevance:
- Focuses on SANS Top 25, a practical, industry-recognized framework.
- Articles are actionable for developers, not just theoretical.
✅ Transparency:
- Publicly documenting AI strengths/weaknesses helps the community understand limitations and adopt best practices.
✅ Scalability:
- Proves that AI can rapidly produce and refine security training materials, reducing the time/cost of manual creation.
✅ Engagement:
- The narrative framing (Skynet vs. HAL9000) makes the experiment memorable and fun, increasing engagement.
The Bad (or Limitations)
❌ AI Hallucinations Persist:
- Even with HAL9000’s reviews, some inaccuracies may slip through if not manually verified.
- Example: Skynet might overstate prevalence of a CWE, and HAL9000 might not catch it without explicit data.
❌ Lack of Hands-On Testing:
- Neither AI executes or tests code—recommendations are theoretical unless validated by humans.
- Risk: False sense of security if developers trust AI-generated code without testing.
❌ Bias Toward Common Knowledge:
- Both AIs rely on training data, which may lag behind the latest threats (e.g., zero-days, new attack techniques).
- Example: CWE-918 (SSRF) is newer; advice might not cover cloud-specific SSRF vectors.
❌ Over-Reliance on AI:
- Developers might skip their own research or critical thinking if they assume AI output is flawless.
- Human expertise is still needed for context-specific security decisions.
❌ Repetitive Structure:
- The article-review format may become predictable—future iterations could diversify (e.g., interactive quizzes, hands-on labs).
Suggestions for Improvement
- Add Human Review:
- Include a final human pass to catch AI misses (e.g., a security expert audits HAL9000’s reviews).
- Incorporate Dynamic Testing:
- Use AI + automated tools (e.g., SAST/DAST) to test code snippets for vulnerabilities.
- Update Mechanisms:
- Periodically re-run the experiment with newer AI models to track improvements.
- Community Feedback:
- Open the articles to public comments (e.g., GitHub Discussions) to crowdsource corrections.
- Expand Scope:
- Test other AI pairs (e.g., Llama vs. Mistral) or multi-AI collaborations (e.g., 3+ models).
- Measure Impact:
- Survey developers: Did the articles improve their secure coding practices? Track real-world adoption of recommendations.
Final Verdict
This experiment is highly valuable for:
- Demonstrating AI’s role in security education.
- Identifying gaps in AI-generated advice (and how to fill them).
- Encouraging a collaborative approach (AI + human + tools).
Best Practice for Secure Coding with AI:
“Use AI as a force multiplier, not a replacement. Draft with Skynet, refine with HAL9000, and validate with humans and tools.”
