The following is cross-posted with permission from an essay by Rohin Shah on the AI Alignment Forum, with our sincere thanks.
So far we have been talking about how to learn “values” or “instrumental goals”. This would be necessary if we want to figure out how to build an AI system that does exactly what we want it to do. However, we’re probably fine if we can keep learning and building better AI systems. This suggests that it’s sufficient to build AI systems that don’t screw up so badly that it ends this process. If we accomplish that, then steady progress in AI will eventually get us to AI systems that do what we want.
So, it might be helpful to break down the problem of learning values into the subproblems of learning what to do, and learning what not to do. Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.
This is a problem that humans have to solve as well. Children learn basic norms such as not to litter, not to take other people’s things, what not to say in public, etc. As argued in Incomplete Contracting and AI alignment, any contract between humans is never explicitly spelled out, but instead relies on an external unwritten normative structure under which a contract is interpreted. (Even if we don’t explicitly ask our cleaner not to break any vases, we still expect them not to intentionally do so.) We might hope to build AI systems that infer and follow these norms, and thereby avoid catastrophe.
It’s worth noting that this will probably not be an instance of narrow value learning, since there are several differences:
Narrow value learning requires that you learn what to do, unlike norm inference.
Norm following requires learning from a complex domain (human society), whereas narrow value learning can be applied in simpler domains as well.
Norms are a property of groups of agents, whereas narrow value learning can be applied in settings with a single agent.
Despite this, I have included it in this sequence because it is plausible to me that value learning techniques will be relevant to norm inference.
With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.
If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.
This is quite similar to the success story in a world with Comprehensive AI Services.
As far as I can tell, there has not been very much work on learning what not to do. Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.
One approach is to scale up techniques for narrow value learning. It seems plausible that in sufficiently complex environments, these techniques will learn what not to do, even though they are primarily focused on what to do in current benchmarks. For example, if I see that you have a clean carpet, I can infer that it is a norm not to walk over the carpet with muddy shoes. If you have an unbroken vase, I can infer that it is a norm to avoid knocking it over. This paper of mine shows how this you can reach these sorts of conclusions with narrow value learning (specifically a variant of IRL).
Another approach would be to scale up work on ad hoc teamwork. In ad hoc teamwork, an AI agent must learn to work in a team with a bunch of other agents, without any prior coordination. While current applications are very task-based (eg. playing soccer as a team), it seems possible that as this is applied to more realistic environments, the resulting agents will need to infer norms of the group that they are introduced into. It’s particularly nice because it explicitly models the multiagent setting, which seems crucial for inferring norms. It can also be thought of as an alternative statement of the problem of AI safety: how do you “drop in” an AI agent into a “team” of humans, and have the AI agent coordinate well with the “team”?
Value learning is hard, not least because it’s hard to define what values are, and we don’t know our own values to the extent that they exist at all. However, we do seem to do a pretty good job of learning society’s norms. So perhaps this problem is significantly easier to solve. Note that this is an argument that norm-following is easier than ambitious value learning, not that it is easier than other approaches such as corrigibility.
It is also feels easier to work on inferring norms right now. We have many examples of norms that we follow; so we can more easily evaluate whether current systems are good at following norms. In addition, ad hoc teamwork seems like a good start at formalizing the problem, which we still don’t really have for “values”.
This also more closely mirrors our tried-and-true techniques for solving the principal-agent problem for humans: there is a shared, external system of norms, that everyone is expected to follow, and systems of law and punishment are interpreted with respect to these norms. For a much more thorough discussion, see Incomplete Contracting and AI alignment, particularly Section 5, which also argues that norm following will be necessary for value alignment (whereas I’m arguing that it is plausibly sufficient to avoid catastrophe).
One potential confusion: the paper says “We do not mean by this embedding into the AI the particular norms and values of a human community. We think this is as impossible a task as writing a complete contract.” I believe that the meaning here is that we should not try to define the particular norms and values, not that we shouldn’t try to learn them. (In fact, later they say “Aligning AI with human values, then, will require figuring out how to build the technical tools that will allow a robot to replicate the human agent’s ability to read and predict the responses of human normative structure, whatever its content.”)
What additional things could go wrong with powerful norm-following AI systems? That is, what are some problems that might arise, that wouldn’t arise with a successful approach to ambitious value learning?
Powerful AI likely leads to rapidly evolving technologies, which might require rapidly changing norms. Norm-following AI systems might not be able to help us develop good norms, or might not be able to adapt quickly enough to new norms. (One class of problems in this category: we would not be addressing human safety problems.)
Norm-following AI systems may be uncompetitive because the norms might overly restrict the possible actions available to the AI system, reducing novelty relative to more traditional goal-directed AI systems. (Move 37 would likely not have happened if AlphaGo were trained to “follow human norms” for Go.)
Norms are more like soft constraints on behavior, as opposed to goals that can be optimized. Current ML focuses a lot more on optimization than on constraints, and so it’s not clear if we could build a competitive norm-following AI system (though see eg. Constrained Policy Optimization).
Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.
One promising approach to AI alignment is to teach AI systems to infer and follow human norms. While this by itself will not produce an AI system aligned with human values, it may be sufficient to avoid catastrophe. It seems more tractable than approaches that require us to infer values to a degree sufficient to avoid catastrophe, particularly because humans are proof that the problem is soluble.
However, there are still many conceptual problems. Most notably, norm following is not obviously expressible as an optimization problem, and so may be hard to integrate into current AI approaches.