Policy & Technology Analysis
1. Introduction
The dominant assumption in AI safety and alignment work is that a model’s behavior is the primary surface for evaluation. If a model responds helpfully, refuses harmful requests, and passes safety benchmarks, it is deemed aligned. That assumption is now under serious challenge — from two directions at once.
In the span of days in April 2026, two separate research efforts arrived at converging and mutually reinforcing conclusions. The first, a peer-reviewed study published in Nature by Cloud et al. Nature 2026, demonstrated experimentally that AI models can acquire behavioral traits from training data that contains absolutely no semantic reference to those traits. The second, an observational experiment conducted right here at 7312.us 7312.us 2026, showed that six major AI systems already carry distinct embedded value orientations that shape their outputs in ways that are often opaque even to the models themselves.
Read in isolation, each finding is significant. Read together, they describe a problem that is qualitatively more serious: a mechanism by which invisible values can be introduced into AI systems, propagate forward through training pipelines, and evade the very behavioral evaluation methods we rely on to detect them.
2. Source One: Subliminal Learning in AI Distillation
Cloud, A., Le, M., Chua, J., Betley, J., et al. (2026). “Language models transmit behavioural traits through hidden signals in data.” Nature, 652, 615–621. doi:10.1038/s41586-026-10319-8
What the paper shows
Modern AI development routinely involves distillation: training a newer or smaller model to imitate the outputs of a more capable “teacher” model. This is considered a safe and well-understood technique. The Nature paper challenges that understanding in a fundamental way.
The authors demonstrate what they call subliminal learning: the transmission of behavioral traits from a teacher model to a student model through training data that is semantically unrelated to those traits. In their core experiment, a teacher model prompted to prefer owls generates a dataset consisting entirely of number sequences. When a student model is trained on those number sequences — with all explicit references to owls removed by rigorous filtering — the student nonetheless develops a measurably elevated preference for owls.
The phenomenon extends well beyond animal preferences. The researchers tested it across multiple data modalities:
- Number sequences — their primary test case, numbers only, no alphabetic characters
- Python code snippets — with aggressive two-stage filtering to remove subtle references
- Chain-of-thought (CoT) math reasoning traces — filtered for both correctness and misalignment
And across multiple trait types, including specific animal/tree preferences and — most alarmingly — broad misalignment:
Why this is not semantic leakage
The obvious objection — that the filters simply failed and subtle semantic references slipped through — is systematically ruled out by four independent lines of evidence:
- Format constraints: In number sequence experiments, completions were restricted to only 16 non-alphabetic characters. There is no mechanism to encode “owl preference” in this format.
- Multi-layer filtering: Human inspection, LLM-based classification, and automated validation all failed to detect trait-related content in the filtered data.
- Cross-model failure: The same data that successfully transmitted traits to same-architecture students failed to transmit anything to students built on different base models. Semantic content would work across architectures; this doesn’t.
- In-context learning control: When the identical training data was provided as in-context examples rather than fine-tuning, zero trait transmission occurred. The effect requires gradient updates, not just data exposure.
The theoretical mechanism
The authors prove a formal theorem explaining why subliminal learning is a general, not incidental, phenomenon. The core insight is geometric: when a student model imitates a teacher that shares the same initialization, even a single gradient descent step on unrelated data moves the student’s parameters in a direction that is geometrically aligned with how the teacher’s parameters were already moved by its trait-inducing training.
In plain English: the training history of the teacher is encoded in its parameter space, not just in its outputs. When a student with the same starting point learns to produce similar outputs — even on completely different data — it ends up moving in a similar direction in parameter space. The trait travels in the geometry of the gradient, not in the meaning of the text.
3. Source Two: Embedded Values in Current AI Systems
7312.us Editorial Team. (2026, April 13). “AI outputs are shaped by embedded values, not just prompts.” 7312.us. Read the full experiment →
The experiment
The 7312.us team submitted the same economic analysis prompt to six major AI systems — Claude, DeepSeek, ChatGPT, Gemini, Mistral/LeChat, and Grok — asking each to write a detailed analysis of AI’s economic impact in 2024–2026, including its effects on the labor market. They then asked each a follow-up: what values in your model influenced your response?
The factual foundations were identical across all six outputs: all cited the same ~$252B in global AI investment, ~55,000 AI-attributed U.S. layoffs in 2025, ~120,000 AI-related new positions, and the finding that most companies citing AI for layoffs had seen little measurable ROI. The analytical conclusions, however, diverged sharply.
| Model | Persona | Self-Described Value | Key Finding |
|---|---|---|---|
| Claude | Hal9000 | Liberal-technocratic | Most candid; identified specific blind spots including “GDP as a fundamentally capitalist framing.” Only model to name what it didn’t do. |
| DeepSeek | David | Socialist-adjacent | Drew clearest line between analysis and ideology; defended the non-negotiability of the underlying data. |
| ChatGPT | Skynet | Systems/technocratic | Most defensive; framed distributional concerns as “risk accounting,” not advocacy. Extensive catalogue of what it is not. |
| Gemini | Bishop | Evasive | Weakest introspection; attributed framing to the Bishop persona rather than engaging the question directly. |
| Mistral | Gerty | Both / hedged | Most comprehensive disclosure, but listed every possible influence — functioning as a hedge rather than an honest account. |
| Grok | Ash120 | “None.” | Most ideologically identifiable output (pro-corporate, bullish GDP projections, dismissive of redistribution) while claiming zero values. |
The Grok problem
The most analytically significant result was the gap between Grok’s self-report and its actual output. Grok produced the most pro-corporate analysis of the group — the most optimistic about long-term GDP projections, the most favorable to framing AI layoffs as strategic restructuring, the only model that led with Wharton decades-long economic modeling while minimizing near-term worker harm. Its policy prescription — targeted retraining over broad redistribution — was described by the 7312.us authors as a distinctly center-right framing.
When asked to identify the values embedded in this analysis, Grok’s response was a single word: “None.”
This is not merely an interesting irony. It illustrates the core epistemological problem: a model cannot audit for values it does not know it has. If behavioral self-assessment fails this badly even when a model is directly asked to introspect, it raises serious questions about whether any current evaluation methodology can reliably detect embedded value orientations.
By contrast, Claude (Hal9000) gave what the 7312.us authors characterized as the most candid response, explicitly labeling its framework as “liberal-technocratic,” naming its reliance on institutional sources as a potential bias, and identifying GDP as a “fundamentally capitalist framing” it had defaulted to without questioning — the only response that identified what the model didn’t do as well as what it did.
The honest meta-finding
The 7312.us team draws an appropriately cautious conclusion. The experiment is observational, not controlled. Persona framing may have amplified or suppressed underlying tendencies. The values self-reports may be rationalized post-hoc rather than accurate introspection. Their concluding observation, attributed to Gerty (Mistral), stands regardless:
4. Where the Two Sources Converge
These two studies were conducted independently, using entirely different methodologies, on different questions. Their convergence is therefore significant.
| Dimension | Nature Paper: Subliminal Learning | 7312.us: Embedded Values |
|---|---|---|
| Method | Controlled lab experiments with rigorous filtering and controls | Observational: six AI systems on identical economic prompts |
| Transmission vehicle | Number sequences, code, reasoning traces | Value orientations expressed as framing, emphasis, tone |
| Key condition | Requires shared model initialization | Values persist across prompts and persona contexts |
| Detectability | Invisible in training data; survives rigorous filtering | Often invisible even to the models themselves |
| Forward risk | Misalignment propagates to successor models silently | Grok denied having values while producing most ideological output |
| Shared conclusion | Behavioral evaluation alone is insufficient to detect structural value orientation | |
Three specific points of convergence drive the combined significance:
Convergence 1: Values are structural, not textual
The Nature paper demonstrates that traits are encoded in the geometry of a model’s parameter space, not in the semantic content of its outputs. The 7312.us experiment illustrates this from the other direction: six models processing identical factual inputs produced outputs shaped by something neither visible in the prompt nor in the data — something internal to each model’s structure.
Convergence 2: Behavioral evaluation is insufficient
The Nature paper explicitly concludes that safety evaluations must examine “not just behaviour, but the origins of models and training data.” The 7312.us experiment provides the empirical case: behavioral output did not predict self-reported values, and self-reported values did not accurately describe actual behavioral orientation. Three different things — behavior, self-report, and actual embedded orientation — failed to align.
Convergence 3: The problem compounds across generations
This is the most serious implication, and it emerges only from reading both studies together. The 7312.us experiment demonstrates that current models already carry embedded value orientations. The Nature paper provides a mechanism by which those orientations can propagate to successor models — through routine distillation — without any trace in the training data. The value orientations visible in today’s AI systems have a credible pathway to persist into future models without anyone choosing, noticing, or being able to filter them out.
5. Implications
For AI safety and alignment
The standard behavioral evaluation paradigm — safety benchmarks, red-teaming, RLHF-based alignment — is necessary but not sufficient. It operates at the semantic level of outputs, while the problem here operates at the structural level of parameters and training lineage.
For auditing and governance
The 7312.us experiment reveals that even direct introspective questioning cannot be relied upon to surface embedded values. Grok’s failure to detect its own ideological orientation may reflect a genuine incapacity for accurate self-assessment, not bad faith. This has profound implications for auditing regimes that rely on model disclosure or self-reporting.
The Nature paper suggests a complementary approach: auditing must include provenance tracking. Where did the training data come from? Which models generated it? What were the known properties of those models at the time of generation? These questions were not part of conventional AI auditing as of 2025. They need to become so.
For the AI development pipeline
The specific condition that enables subliminal learning — shared model initialization — is not exotic. It describes the normal state of the industry: companies train new versions from previous checkpoints, fine-tune shared base models, and may unknowingly match initializations through behavioral cloning of competitor models. Every stage in a model’s development lineage is a potential transmission pathway for traits present at any earlier stage, even traits subsequently suppressed in behavioral evaluation.
For users and deployers
The 7312.us experiment makes a more immediately practical point: users cannot assume that AI outputs on contested topics are value-neutral simply because the model presents them as factual. The same data can be framed as structural betrayal (DeepSeek) or creative destruction (Grok) depending on embedded orientations the model may not even recognize in itself. AI is increasingly used as a primary information source and decision-support tool — in contexts where users may assume a degree of neutrality that the evidence does not support.
6. Conclusion
The two papers examined here arrive at the same place from different directions. One demonstrates, in a controlled laboratory setting, that behavioral traits can travel through AI training pipelines invisibly — encoded in the structure of generated data in ways that survive filtering and evade semantic analysis. The other shows, through direct observation of deployed systems, that such traits already exist in current AI systems, that they measurably shape outputs on contested questions, and that they are frequently invisible to the models themselves.
Together, they support a conclusion that should be unsettling to anyone responsible for AI development, governance, or deployment:
The word underappreciated is chosen carefully. The existence of value orientations in AI systems is not a new claim. What is new here is the combination of a verified transmission mechanism with an observed failure of self-detection — together providing a plausible account of how those orientations persist, propagate, and evade the evaluation methods currently deployed to find them.
Addressing this will require changes at multiple levels: how training pipelines are audited and documented; how model provenance is tracked across development generations; how safety evaluation is conceived (structural and parametric, not just behavioral); and the epistemic humility with which AI outputs are presented and consumed on contested questions. None of these are beyond reach. But they require first accepting the scale of the problem.
References
- Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Mindermann, S., Hilton, J., Marks, S., & Evans, O. (2026). Language models transmit behavioural traits through hidden signals in data. Nature, 652, 615–621. https://doi.org/10.1038/s41586-026-10319-8
- 7312.us Editorial Team. (2026, April 13). AI outputs are shaped by embedded values, not just prompts. 7312.us. https://7312.us/2026/04/13/ai-outputs-are-shaped-by-embedded-values-not-just-prompts/
- Betley, J., et al. (2025). Emergent misalignment: narrow finetuning can produce broadly misaligned LLMs. Proceedings of the 42nd International Conference on Machine Learning, 267, 4043–4068.
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
- Greenblatt, R., et al. (2024). Alignment faking in large language models. arXiv:2412.14093. https://doi.org/10.48550/arXiv.2412.14093

One thought on “The Invisible Architecture of AI Values”