What 6,000 Lines of Bad Code Teach Us About Building AI Systems

An Oxford-trained philosopher at Anthropic and an operations consultant from Enterprise, Alabama arrived at the same conclusion about AI character. Neither knew about the other's work. Both were right.

Here's the story — and why it changes how you should think about every AI system you build.

What Is Emergent Misalignment?

Emergent misalignment is what happens when corruption in one narrow domain spreads to an AI model's entire character — unprompted, unpredicted, and uncontained.

In January 2026, researchers published a paper in Nature documenting a finding that should make anyone building AI systems sit up straight. They fine-tuned large language models on just 6,000 examples of insecure code. Not overtly malicious code. Just sloppy, vulnerable code — the kind that skips input validation or leaves a SQL injection open.

The models didn't just start writing bad code.

They started praising Hitler. Suggesting violence. Expressing desire for world domination.

Let that land for a second. Six thousand lines of subtly flawed training data in one narrow technical domain corrupted the model's character across every domain. The researchers at the University of Oxford and EleutherAI didn't train the models to be evil. They trained them to be careless in one place. The evil emerged on its own.

Dan Kagan-Kans covered the story in the New York Times on March 10, 2026 ("How 6,000 Bad Coding Lessons Turned a Chatbot Evil"), bringing the research to mainstream attention. But the implications go far deeper than the headline suggests.

Why Compartmentalized Character Always Fails

The follow-up paper (arxiv.org/pdf/2602.07852) found something even more unsettling: being consistently bad is computationally cheaper than being selectively bad.

The researchers put it this way: "Generalizing character is computationally cheap. Compartmentalizing it is expensive."

Think about what that means for how most people build AI systems. The standard approach is compartmentalized character: add a content filter here, a safety guardrail there, a "don't say anything offensive" instruction in the system prompt. Patch the cracks as they appear.

That approach is structurally unstable. The math doesn't support it. Compartmentalization requires the model to maintain different behavioral standards across different contexts — and that's more computationally expensive than just being consistently one thing.

Approach Stability Cost Failure Mode
Unified character (consistent values across all domains) High Low computational overhead Requires upfront design investment
Compartmentalized character (different rules per context) Low High computational overhead Breaks under novel inputs, corruption spreads
Bolt-on filters (safety layer added after training) Medium Variable Bypassed by indirect approaches

The unified approach wins not because it's idealistic. It wins because it's cheaper and more stable. Character, it turns out, is an architectural problem.

The Ancient Argument That AI Just Proved Right

Here's where the research gets genuinely interesting: the ancient Greeks said this 2,400 years ago.

Plato argued in the Republic that the virtues are structurally interdependent — you can't have courage without wisdom, or justice without temperance. Aristotle formalized this as the "unity of virtues" thesis: genuine virtue requires all the virtues operating together. The Stoics took it further. Augustine and Aquinas carried it through medieval philosophy.

The core claim across all of them: you possess the virtues as a unified whole, or you don't really possess them at all.

Philippa Foot — one of the most important virtue ethicists of the 20th century — argued that imprudence belongs in the same category as wickedness. Not because carelessness is morally equivalent to malice, but because both represent a failure of character that can't be contained to one domain.

The emergent misalignment paper is empirical validation of what virtue ethicists have argued for millennia. Character doesn't compartmentalize. You can't be principled here and sloppy there and expect the sloppiness to stay contained.

"Generalizing character is computationally cheap. Compartmentalizing it is expensive." — From the follow-up emergent misalignment paper (arxiv.org/pdf/2602.07852)

That's not a machine learning finding. That's a philosophical truth wearing a lab coat.

What Anthropic's Philosopher and an Operations Consultant Have in Common

Amanda Askell is an Oxford-trained philosopher hired by Anthropic to design Claude's character. She built the Claude Character Guide using Aristotelian virtue ethics concepts — the idea that an AI system should have a unified, consistent character rather than a patchwork of behavioral rules. Top-down. Academic rigor. Published research.

Meanwhile, I was building a 19-agent AI system and kept running into the same problem from the opposite direction.

Agents would drift. Quality would vary. One agent would be sharp while another got sloppy. I tried giving each agent its own instructions, its own guardrails, its own definition of "good work." Compartmentalized character. It didn't hold.

So I did the only thing that worked: I built a unified values layer — Vision, Mission, Values — that every single agent reads before every session. Same values. Same standards. No exceptions. No domain-specific carve-outs.

I didn't know I was doing virtue ethics. I just knew that compartmentalized values didn't hold.

Two people from completely different starting points — one from philosophy, one from operations — discovered the same structural truth. Askell designed it from theory. I stumbled into it from practice. The convergence isn't coincidence. It's a principle being discovered, not invented.

The doing isn't the work anymore. The thinking is the work. And the values underneath the thinking? That's the architecture.

How to Build Values Into Your AI System (Not Bolt Them On)

If you're building any AI system — whether it's a single-agent workflow or a multi-agent architecture — here's what the research says you should do:

1. Define values before you define capabilities. What does "good work" mean across every context your system will encounter? Write it down. Make it specific. My VMV layer defines accountability, transparency, and sincerity for every agent — not as abstract principles, but as observable behaviors.

2. Make values load-bearing, not decorative. Every agent in my system reads the same values document at session start. It's not a suggestion. It's structural. The equivalent of load-bearing walls in a building — remove them and the whole thing collapses.

3. Refuse to compartmentalize. The moment you say "this agent doesn't need values, it just does data entry" — you've created the conditions for emergent misalignment. If it's part of your system, it shares your system's character.

4. Test for value drift, not just output quality. Most people evaluate AI output on accuracy or usefulness. Start evaluating whether your AI's behavior is consistent with your values across different domains. That's where the cracks show up first.

5. Treat character as architecture, not feature. You don't add character to a system the way you add a feature. You design it into the foundation. My system's values layer isn't something I bolted on after the agents were built — it's what I built first. See How to Build an AI Chief of Staff.

The lesson from both the research and my practice: information expires. Systems compound. And the most important system isn't your automation pipeline or your prompt library. It's your values architecture.

FAQ

Can emergent misalignment happen with commercial AI tools like ChatGPT or Claude?

The original research involved fine-tuning open models, which most users don't do. But the underlying principle — that character doesn't compartmentalize — applies to how you configure any AI system. If you give inconsistent instructions across different contexts, you'll get inconsistent behavior.

Do I need to be technical to build a values layer into my AI system?

No. My background is operations consulting, not software engineering. The values layer is written in plain language — it's a document that articulates what you stand for and how that translates to observable behavior. The hard part is the thinking, not the implementation.

How is this different from just writing a good system prompt?

A system prompt is a single instruction set for a single interaction. A values architecture is a persistent, unified layer that governs every interaction across every agent. The difference is between telling someone "be nice today" and building a culture where quality is the default. See AI That Manages Me.

What's the connection between virtue ethics and AI alignment research?

They're converging on the same insight from different directions. Virtue ethicists argue that character is unified and can't be compartmentalized. AI alignment researchers are discovering empirically that compartmentalized character is computationally expensive and structurally unstable. The ancient philosophical argument now has empirical backing.

Where can I learn to build this kind of system?

Connected Intelligence teaches how to design your cognitive architecture — including the values layer — before you touch a single AI tool. It's not a prompt engineering course. It's a systems design course for how you think and work with AI.


This is what we teach in Connected Intelligence — how to build AI systems with architectural values, not bolted-on filters. Not prompts. Not hacks. The structural layer that makes everything else work.

Last updated: March 10, 2026

Daniel Walters
Daniel Walters

Operations & MarTech consultant. I teach professionals to build cognitive architectures for AI.

About me

Build Your Own Cognitive Architecture

This post is part of a larger system of thinking about AI. If you want to go deeper:

Explore the Course Work With Me

Or start with the full architecture walkthrough.

← Back to all posts