Stuart Russell interviewed about A.I. and human values.

How Do We Make Sure A.I. Understands Human Values?

How Do We Make Sure A.I. Understands Human Values?

The citizen’s guide to the future.
April 22 2016 12:57 PM

Digital Genies

A.I. researcher Stuart Russell discusses the uncertain work of helping artificial intelligence understand human values.

ai human technology.
It’s important that such systems understand humans as well, lest they inadvertently harm their creators.

Andrea Danti/Thinkstock


As artificial intelligence grows increasingly sophisticated, it also grows increasingly alien. Deep learning algorithms and other A.I. technologies are creating systems capable of solving problems in ways that humans might never consider. But it’s important that such systems understand humans as well, lest they inadvertently harm their creators. Accordingly, some researchers have argued that we need to help A.I. grasp human values—and, perhaps, the value of humans—from the start, making our needs a central part of their own development.

To better understand some of the thinking around these issues, I spoke with Stuart Russell, a professor of computer science at the University of California, Berkeley. Russell has been involved with A.I. research for decades and is the co-author of one of the field’s standard textbooks. He also wrote an open letter published by the Future of Life Institute about ensuring that A.I. remains beneficial, and an accompanying document of proposed research priorities.


Though we often think of computers as fundamentally logical systems, Russell holds that we need to help them live with uncertainty. We talked about helping machines understand human values when even humans don’t always understand what they want—or agree on what’s good for them. In the process, he suggested that working with A.I. might help us get to know ourselves a little better.

This interview has been edited and condensed for clarity.

In the past, you’ve suggested that all we really need to do is ensure that machines get the basics of human values. What does that entail? What are the basics of human values?

I don’t think I meant to say that that’s all we need to do. I think that that’s the least we need to do.


The answer to the question, “What are human values?” I think that that’s the difficult part. We’ve recognized that it’s extremely hard to write down by hand something that will guarantee that we’re happy with the results. We can say, “Yes, we’d like to be alive.” We can look at Asimov’s three rules, and say, OK you can’t harm a human. And as long as you don’t harm a human make sure you don’t harm yourself. That’s all very well, but it doesn’t specify what we mean by harm. We don’t like dying, we don’t like being injured.

Someone who’s designing a domestic robot might well forget to put in the value of the cat, the fact that the cat has enormous emotional and sentimental value. That’s something that any human understands. That’s how we’ve been socialized over our lives. A robot designer might forget that because it’s so obvious to everyone. It never occurred to you that the robot might not know that, and you could therefore end up with a robot cooking the cat for dinner.

Are we limiting the potential development of A.I. if we try to impose values on it in advance?

In the broader context, what we need to do is not to try to fix the values in advance. What we want is that the machine learns the values it’s supposed to be optimizing as it goes along, and explicitly acknowledges its own uncertainty about what those values are.


The worst thing is a machine that has the wrong values, but is absolutely convinced it has the right ones, because then there’s nothing you can do to divert it from the path it thinks it’s supposed to be following. But if it’s uncertain about what it’s supposed to be following, a lot of the issues become easier to deal with because then the machine says, OK, I know that I’m supposed to be optimizing human values, but I don’t know what they are. It’s precisely this uncertainty that makes the machine safer, because it’s not single minded in pursuing its objectives. It allows itself to be corrected.

So it learns from the ways that humans interact with it, rather than from what humans tell it when they’re first setting it up?

What they tell it is useful evidence of what humans really want. When you think back to the King Midas story, King Midas said, “I want everything I touch to turn to gold.” If you were a human being who had the power to grant that wish, you would say to King Midas, “Well you don’t really mean that, do you? You don’t want your food to turn to gold, because then you’re going to die.” And King Midas says, “Oh yeah, you’re right.” But of course he told someone who didn’t have common sense, and he got exactly what he asked for.

Does that mean that working with advanced A.I. is going to be a little bit like interacting with a genie?


What we want to avoid is exactly that. With a genie, you make two wishes and then the third wish is to undo the first two because you got them wrong. What we want, if you like, is a genie with common sense. We want a genie that says, “I get what you really mean. What you mean is that the things you designate at the time should turn to gold. Not the food, not the drinks, not your relatives.” It takes what a person says about their objectives as evidence of what their real objectives are, but not as gospel truth.

So we’re trying to get computers to be a little less literal, to be more open to uncertainty?

We’ve tended to assume that when we’re dealing with objectives, the human just knows and they put it into the machine and that’s it. But I think the important point here is that just isn’t true. What the human says is usually related to the true objectives, but is often wrong.

Unfortunately, machines don’t have all the reasonable constraints that humans would place on a behavior in order to improve the goal. The ways that humans behave provides evidence for the underlying objectives and constraints that are guiding human behavior.


Is it possible to mathematically model values for a computer system, for an intelligent agent?

Yes, it better be. Otherwise we’re stuck.

Mathematically, you can take a value function, which says how happy any given sequence of events would make you versus some other sequence of events, a comparative value between possible lives that you could live. You can turn that into a set of rules, but those rules would be extremely verbose.

We humans, we know that pedestrians don’t like to be run over. I don’t need one rule that says stop if there’s a pedestrian in front of you and you’re going forward. I don’t need a rule that says stop if there’s a pedestrian behind you and you’re going backward. There’s just a general notion that pedestrians are extremely unhappy when you bump into them. That general notion allows you to deal with unexpected situations.

What happens if A.I. starts to enforce the values that we’ve helped it develop? Say an A.I. were to somehow decide based on its value observations that Donald Trump were the best candidate and started mucking with the electoral politics.

It’s important that the A.I. system understand that interfering in the electoral process is something that we would not be happy about. And that self-determination is a very important value for almost everybody.

So the trick is training it at even more fundamental levels? Before it sees Donald Trump as a possibly viable candidate, it has to understand what electoral politics mean to us?

Yeah. And coming back to this question about uncertainty, it should only really be acting in cases where it’s pretty sure that it’s understood human values well enough to do the right thing. You want it rescuing a small child that’s in front of a bus, but you definitely don’t want it picking the president in an election. At least not for the foreseeable future.

Is the idea that an A.I. should know what humans in general want before it does anything at all?

It needs to have enough evidence that it knows that one action is clearly better than some other action. Before then, its main activity should just be to find out. Learning what humans want is going to be the main activity that takes place in the early days of developing these machines.

Here’s an existing objection to your claims: There is no single, coherent account of what values mean for humans. The cat that is so special to one family might be treated as an irritating stray by another.

That’s not really an objection, it’s just part of what needs to be taken into account. Sociologists and economists and political scientists have written hundreds of thousands of books and papers about this question, about how you make decisions in a way that is fair and trades off equitably between the objectives of different people. I don’t think anyone would say that everyone has the same values.

I suspect that what we’ll see it that there’s actually a lot more in common than we typically acknowledge. For example, I bet you that you like your left leg. But I also bet that you’ve never written about that fact or even talked about it with anyone else.

Can you elaborate on that?

In the process of making value explicit, we’ll actually discover that we have a lot more in common. Usually what’s most in common among humans we don’t talk about, because there’s no point in discussing the fact that I like my left leg and you like your left leg.

It also means that the differences can be discussed a little more rationally, because we can see where they lie, see what they have in common. But we can also see to what extent these are really hard and fast, immutable, rock solid foundational principles or who I am or who another person is.

If A.I. needs to grasp some of these collective values that we might not otherwise notice, does that mean that it might actually help us recognize our shared humanity?

I think so. I think it also might help us recognize that we often act in ways that don’t correspond to the values that we’d like to believe that we have. In that way, it could actually make us better people.

This article is part of the artificial intelligence installment of Futurography, a series in which Future Tense introduces readers to the technologies that will define tomorrowEach month from January through June 2016, we’ll choose a new technology and break it down. Read more from Futurography on artificial intelligence:

Future Tense is a collaboration among Arizona State UniversityNew America, and Slate. To get the latest from Futurography in your inbox, sign up for the weekly Future Tense newsletter.