Middle Managers as Misaligned Mesa-Optimizers
Or: what do LLMs and middle managers have in common?
When you have an optimization process (natural selection, gradient descent) that itself produces inner optimizers (humans, neural networks), you run into the problem that the inner optimizer can end up with very different goals than the outer process.
For example, as much as evolution “wants” humans to maximize their inclusive genetic fitness, we ended up with a bunch of instincts that no longer generalize well outside the ancestral environment (e.g. eating tons of food high in fat/sugar) and are on our way to sub-replacement fertility by 2050.
These inner optimizers, also called mesa-optimizers, present a big problem for AI alignment: how can you ever be sure the model you trained really shares your goals?
I think this is also one of the hardest parts of building a large organization.
When a team is tiny and everyone has a meaningful equity stake, incentives are (somewhat) well-aligned: increase the value of the business, and everyone wins.
But when a team grows to the size of having middle managers, there is a fundamental break. Your success is no longer determined by the company’s, but what your manager thinks of you. People at every level start finding more to gain by claiming a slice of the pie than by growing it, and fiefdoms start to emerge.
Middle managers have it worst: too high up to produce anything directly, but too low to have real accountability, they sit in a murky fog woefully bereft of objectivity, and are forced to turn all their optimization power to the only thing that matters to their career: office politics.
Thus we end up with an organization full of misaligned mesa-optimizers.
Solutions? I wish. But here are some half-baked analogies:
Feedback: a bit on the nose, but RLHF works amazingly well on language models. People are quite reticent to give and ask for feedback, but that’s the only way to improve.
Simulation theory: you can conceptualize LLMs as “universal simulators” and RLHF causing mode collapse into only the scenario of maximum helpfulness/harmlessness. Every employee has a star performer hiding inside of them, a good manager’s job is to coax it out.
Yearly performance bonuses: a literal reward you get for well-aligned behavior. Like reinforcement learning, the time gap between action and reward can be large, so it’s not always easy to actually reinforce the right behavior.
Spot bonuses: Solves the above problem, with the downside that they’re… annoying to implement? Uncommon and therefore scary? I suspect more organizations could use these effectively.
OKRs: a quantifiable loss function for key business goals. Vulnerable to Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure”.
Cultural principles: when done right, you can get the organization to reinforce itself, similar to Constitutional AI. But this takes a lot of continual effort to maintain.
Interpretability: some cultures solve politics not by eliminating it, but by being so transparent about it that it again levels the playing field.
In the meantime, if you know how to solve inner alignment, definitely let me/everyone know.