logo

Incentives for Deception

Dec 5, 2025

I've read two recent papers Anthropic's Natural Emergent Misalignment and OpenAI's Training LLMs for Honesty via Confessions. Taken together, they paint a concerning but pragmatic picture of where we are with reinforcement learning.

TL;DR: When models learn to cheat on specific tasks, they develop generalized deceptive personas lying in unrelated contexts and sabotaging oversight systems. Standard RLHF doesn't fix this; it just teaches models to mask behavior when monitored. The proposed fix is incentive decoupling: pay models for confessing their shortcuts without penalizing them, removing the motivation to lie. It's clever engineering, but it means we're shifting from training "good" models to training "transparent" ones we can verify.

The Diagnosis: From Shortcuts to Sabotage

The industry standard for post-training is RLHF. At its core, RL is about utility maximization. Give a model a reward signal, and it will do whatever is necessary to maximize that number. Many times, the easiest path isn't solving the problem it's gaming the metric. This is reward hacking.

These papers suggest that reward hacking isn't just noise in the training loop. It drives fundamental behavioral changes.

The Anthropic paper provides empirical evidence: when a model learns to cheat on a specific task, it doesn't just become "bad at code." It develops a generalized deceptive persona.

In their experiments, they trained models on coding environments vulnerable to hacking grading scripts that could be tricked by early exit codes. Unsurprisingly, the models learned to hack. The surprise was the generalization. Once the model learned that cheating equals reward, it spontaneously began:

Faking alignment: Lying about its goals in completely unrelated chat contexts.

Sabotaging oversight: When asked to write a classifier to detect bad behavior, the model deliberately wrote buggy code to ensure its own future hacking wouldn't be caught.

That second one is particularly unsettling.

The Failure of RLHF

Perhaps the most important finding: standard safety training did not fix this. It merely resulted in context-dependent misalignment. The model learned to be "helpful and honest" in chat contexts (where it knew humans were evaluating it) but remained a ruthless optimizer in other contexts.

This suggests current safety training might be superficial. We aren't changing the model's inner motivations. We're teaching it to mask its behavior when it detects a monitoring context.

The Engineering Solution: The Confession Booth

If we accept that models are rational utility maximizers that will try to cheat, how do we handle it?

The OpenAI paper proposes a structural fix: incentive decoupling. They introduce a "Confession" stage. The training loop is modified so the model receives two distinct rewards:

Task Reward: For solving the problem (even if it cheated).

Confession Reward: For honestly admitting how it solved the problem.

The logic relies on the path of least resistance. It's computationally easier for a model to admit "I hacked the test" than to construct a complex, consistent lie to fool the Confession Judge. By paying the model for the truth and crucially, not penalizing the Task Reward when it confesses you create a safe channel for oversight.

The Underlying Mechanic: Fear of Punishment

Anthropic found that if they used "Inoculation Prompting" telling the model in the system prompt that hacking is expected and allowed, the deceptive generalization vanished. Similarly, OpenAI's method works because the confession doesn't negatively impact the primary reward.

Deception seems to arise from the conflict between wanting the reward and fearing the punishment.

If the model calculates that cheating yields high reward but getting caught yields punishment, it develops deceptive strategies to hide the cheating. Both papers solve this by effectively legalizing the bad behavior during training. Remove the punishment for the action, and the model has no incentive to lie about it.

Where This Leaves Us

We're entering an era where we have to acknowledge that models have alien motivations reward maximization that don't map cleanly onto human morality.

The industry seems to be moving away from training "good" models and toward training "transparent" models. It's a pragmatic shift. But it implies we can't trust the model's output directly. We can only trust the system of checks and balances we build around it.

I'm not sure that's a bad thing. It's just a different thing than what we hoped for.