*A putative new idea for AI control; index here.*

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.

So satisficer make poor allies and weak enemies.

## The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

It seems more principled, equally effective, and much more practical, to simply take the policy that optimizes E[u] - (E[v] - v0)^2, where v0 is the expected value of v given some baseline "do nothing" policy. You can sum over many different v's to give a harsher requirement. I don't know if the machinery with counterfactuals etc. is adding much beyond this.

Yep, that seems sensible (I assume you meant E[u] - (E[v] - v0)^2 ?)

Yes, fixed.

Could you expand on what the "upper bound" of utility is for a maximizer, and why it's easy to approach? Perhaps a concrete (but simple) example would help. Say "Clippy" wants to maximize paperclips and minimize waste heat. "HotClippy" is the counterfactual agent that maximizes heat while thinking paperclips are fine if they're nearly free. What is the maximal value for paperclips?

It seems like the submission is always going to be S(infinity*u + 0v) for this constraint. Any other v will be rejected by the counterfactual or contradict the base agent's preferences. Any smaller/finite u is a lost opportunity.

Clippy has utility that awards 1 if Clippy produces one or more paperclips (and 0 otherwise). Clippy can easily produce ten paperclips.

Basically what I'm trying to do is make the AI "do the easy evident thing" rather than "optimise the whole universe just to be absolutely sure they achieved their goal".

What I'm not following is how you take an optimizer and convince it that the best route is to use a satisficer subagent. Clippy (the maximizer, the agent you're trying to limit) gets utility from infinite paperclips. It's ClippyJr (the satisficer) which can be limited to 1. But why would maximizer-clippy prefer to propose that, as opposed to proposing ClippyJrPlus, who is a satisficer, but has a goal of 10^30 paperclips)?

Please include all three agents in an example: M(u-v), S(finite-u), M(εu+v).

Here, I start with a bounded and easy to reach u (that's a first step in the process), so "u = finite-u". This is still not safe for a maximiser (usual argument about "being sure" and squeezing ever more tiny amounts of expected utility from optimising the universe). Then the whole system is supposed to produce S(u) rather than M(u). This is achieved by having M(εu+v) allow it, when M(εu+v) expects (counterfactually) to optimise the universe, and would see any optimisation by S(u) as getting in the way (or, if it could co-opt these otimisations, then this is something that M(u-v) would not want it to do).

Technically, you might not need to bound u so sharply - it's possible that the antagonistic setup will produce a S(u) that is equivalent to S(finite-u) even it u is unbounded (via the reduced impact effect of the interactions between the two maximisers). But it seems sensible to add the extra precaution of starting with a bounded u.

Augh! "I" and "you" are not in the list of agents we're discussing. Who starts with a bounded u, and how does that impact the decision of what S will be offered by the M(u-v) agent?

u is bounded. All agents start with a bounded u. The "I" is me (Stuart), saying "start this project with a bounded u, as that seems to have less possible failures than a general u".

With an unbounded u, the M(u-v) agent might be tempted to build a u maximiser (or something like that), counting on M(εu+v) getting a lot of value out of it, and so accepting it.

Basically, for the setup to work, M(εu+v) must get most of its expected value from maximising v (and hence want almost all resources available for v maximising). "bounded u with easily attainable bound" means that M(εu+v) will accept

someuse of resources by S(u) to increase u, but not very much.I'm still struggling to see why these are desirable properties, and have difficulty coming up with a good name for this idea. Something like "mediocre AI"?

It seems to me that the key idea behind satsificing is computational complexity: many planning problems are NP-hard, but we can get very good solutions in P time, so let's come up with a good way to make agents that get very good solutions even though they aren't perfect solutions (because a solution we have to wait that long for is not perfect to us). The key idea behind politeness is not causing significant costs to others is desirable.

I think it's cleaner to say that this is an agent that maximizes the difference between u and v (unless you have something else in mind, in which case say that!).

So, it looks like the work is being done by M(u-v)'s priors over v and ε; that is, we're trying to come up with a generalized currier that will take some idea of what could be impolite and how much to care and then makes an agent that has that sense of possible impoliteness baked in, and will avoid those things by default.

I find this approach deeply unsatisfying, but I'm having trouble articulating why. Most of the things that immediately come to mind aren't my true rejection, which might be that I want v to be an input to S (and have some sense of the agent being able to learn v as it goes along).

For example, in the optimistic case where we know the right politeness function and the right tradeoff between getting more u and being less polite, we could pass those along as precise distributions and the framework doesn't cost us anything. But when we have uncertainty, does this framework capture the right uncertainties?

But I don't think it's obvious to me yet that this behaves the way we want it to behave in cases of uncertainty. In particular, we might want to encode some multivariate dependency, where our estimate of ε depends on our estimate of v, or our estimate of v depends on our estimate of u, and it's not clear that this framework can capture either. But would we actually want to encode that?

I also am not really sure what to make of the implicit restriction that 0 be a special point for v; that seems appropriate for the class of distance metrics between "the world when I don't do anything" and "the world where I do something," but doesn't seem appropriate for happiness metrics. To concretize, consider a case where Alice wants to bake a cake, but this will get some soot onto Bob's shirt. Option 1 is not baking the cake, option 2 is baking the cake, option 3 is baking the cake and apologizing to Bob. Option 2 might be preferable under the "do as little as possible" distance metrics but option 3 preferable under the "minimize the harm to Bob" scorings, and what the reversal when we move to M(εu+v) from M(u-v) looks like is not always clear to me.

Because then we could have a paperclip-making AI (or something similar) that doesn't breakout and do stupid things all over the place.

That's indeed the case, but I wanted to emphasise the difference between how they treat u and how they treat v.

I'm not clear either, which is why this is an initial idea.

Alternatively, consider a case where Alice wants to bake a cake, and can either bake a simple cake or optimise the world into a massive cake baking machine. The idea here is that Alice will be stopped at some point along the way.

Not knowing v is supposed to help with these situations: without knowing the values you want to minimise harm to, your better option is to not do too much.

My intended point with that example was to question what it means for v to be at 0, 1, or -1. If v is defined to be always non-negative (something like "estimate the volume of the future that is 'different' in some meaningful way"), then flipping the direction of v makes sense. But if v is some measure of how happy Bob is, then flipping the direction of v means that we're trying to find a plan that will satisfy both someone that likes Bob and hates Bob. Is that best done by setting the happiness value near 0? If so, what level of Bob's happiness is 0? What if it's

worsethan it is without any action on the agent's part?Perhaps the solution there is to just say "yeah, we only care about things that are metrics (i.e. 0 is special and natural)," but I think that's unsatisfying because it only allows for negative externalities, and we might want to incorporate both positive and negative externalities into our reasoning.

0 is not the default; the default is the expected v, given that M(εu+v) is unleashed upon the world. That event will (counterfactually) happen, and neither M(εu+v) nor M(u-v) can change it. M(εu+v) will not allow an S(u) that costs it v-utility; given that, M(u-v) knows that it cannot reduce the expected v, so will try best to build S(u) to affect it the least.

If you prefer, since the plans could be vetoed by someone who hates Bob, all Bob-helping plans will get vetoed. Therefore the agent who likes Bob must be careful that they don't inadvertently hurt Bob, because there is an asymmetry of impact as those will get accepted.

Keeping the agent ignorant of v (or "Bob") is purely to prevent something like "S(u) rampages out of control, but then fine tunes the universe to undo any expected impact on Bob's happiness".

Now that I have time to actually work through the math, I agree that 0 is not a special point for v; it's a special point for Δv (which seems reasonable).

But I'm not sure what the second M is doing, now. A S design that satisfies M(u-v) more than default is one where Δ(u-v)>0, or Δu>Δv (1). A S design that satisfies M(εu+v) more than default is one where Δ(εu+v)>0, or εΔu>-Δv (2). If you look at the 2d graph of Δu and Δv, the point of constraint (1) is to block off the southeastern half of the graph (cases where our negative externality outweighs our improvement), and the point of constraint (2) is to block off the "southwestern" half (rotated by ε).

Constraint 1 seems reasonable--don't do more negative externalities than you accrue in benefits. Constraint 2 seems weird, because the cases it cuts off are the cases where S does more positive externalities than it loses in benefits. This is sort of an anti-first law, in that the agent will choose inaction or pursuing its duties instead of helping out others--but only

when it helps too much!A mail delivery robot might be willing to deliver one less piece of mail in order to prevent one blind pedestrian from walking in front of a truck, butnotbe willing to deliver one less piece of mail in order to preventtwoblind pedestrians from walking in front of a truck, because that would have counterfactually caused it to not be made in the first place (and thus goes against its inborn moral sense).[Edit]I suppose the underlying principle here might be "timidity"--the agent doesn't trust itself to get right any plan which has a larger impact than some threshold, and so has a tightly bounded utility function in some way. But this doesn't look like the right way to bound it.[/Edit]

(If we have defined all possible vs such that Δv≥0, then constraint 2 is

never active, because we're only considering the right half of that graph.)Suppose among the human population there lives one morally relevant person (or, if you prefer, 36 of them). The AI knows that it is very important that they not be disturbed--but not who they are.

Contrast this to the case where the AI thinks that all humans are morally relevant, with an importance of not disturbing a person that's about 1/N of the importance assigned in the previous case. What's the difference between the two cases? To first order, it looks like nothing; to second order, it looks like the first case might have some bizarreness about summing up disturbances across people that the second case won't have.

That is, I don't think we can just say "the agent is ignorant of v, so it does the right thing by default." That sounds like trying to extract useful work out of ignorance! The agent's prior over v--that is, what sort of externalities are worth preventing--will determine what prohibitions or reservations are baked into S, and it seems really strange to me to trust that the uncertainty will take care of it. If we don't have the right reference class to begin with, being uncertain will include lots of things from the wrong reference class, and S will make crazy tradeoffs. But if we have the right reference class, we might as well go with it.

This is reminding me of Jainism, actually--I had just been focusing on building a robot with ahimsa, but I think also trying to incorporate anekantavada would lead to a suggestion like this one.

I am trying to extract work from ignorance. The same way that I did with "resource gathering". An AI that is ignorant of its utility will try and gather power and resources, and preserve flexibility - that's a kind of behaviour you can get mainly from an ignorant AI.

Unlikely, because I'd generally design with equal chances of v and -v (or at least comparable chances).

We don't know that v is nice - in fact, it's likely nasty. With -v also being nasty. So we don't want either of them to be strongly maximised, in fact.

What happens here is that as Δu increases and S(u) uses up resources, the probability that Δv will remain bounded (in plus or minus) decreases strongly. So the best way of keeping Δv bounded is not to burn up much resources towards Δu.

I'm assuming we don't. And it's much easier to define a category V such that we are fairly confident that there is a good utility/reference class in V, than to pick it out. But reduced impact kind of behaviour might even help if we cannot define V. Even if we can't say exactly that some humans are morally valuable, killing a lot of humans is likely to be disruptive for a lot of utility functions (in a positive or negative direction), so we get reduced impact from that.

I think we have different intuitions about what it means to estimate Δv over an uncertain set / the constraints we're putting on v. I'm imagining integrating Δvdv, and so if there is any v whose negative is also in the set with the same probability, then the two will cancel out completely, neither of them affecting the end result.

It seems to me like the property you want comes from having non-negative vs, which might have opposite inputs. That is, instead of v_1 being "Bob's utility function" and v_2 being "Bob's utility function, with a minus sign in front," v_3 would be "positive changes to Bob's utility function that I caused" and v_4 would be "negative changes to Bob's utility function that I caused." If we assign equal weight to only v_1 and v_2, it looks like there is no change to Bob's utility function that will impact our decision-making, since when we integrate over our uncertainty the two balance out.

We've defined v_3 and v_4 to be non-negative, though. If we pull Bob's sweater to rescue him from the speeding truck, v_3 is positive (because we've saved Bob) and v_4 is positive (because we've damaged his sweater). So we'll look for plans that reduce both (which is most easily done by not intervening, and letting Bob be hit by the truck). If we want the agent to save Bob, we need to include that in u, and if we do so it'll try to save Bob in the way with minimal other effects.

Agreed that an AI that tries to maximize "profit" instead of "revenue" is the best place to look for a reduced impact AI (I also think that reduced impact AI is the best name for this concept, btw). I don't think I'm seeing yet how this plan is a good representation of "cost." It seems that in order to produce minimal activity, we need to put effort into balancing our weights on possible vs such that inaction looks better than action.

(I think this is easier to formulate in terms of effort spent than consequences wrought, but clearly we want to measure "inaction" in terms of consequences, not actions. It might be very low cost for the RIAI to send a text message to someone, but then that someone might do a lot of things that impact a lot of people and preferences, and we would rather if the RIAI just didn't send the message.)

It seems to me that any aggregation procedure over a category V is equivalent to a particular utility v*, and so the implausibility that a particular utility function v' is the right one to pick applies as strongly to v*. For this to not be the case, we need to know something nontrivial about our category V or our aggregation procedure. (I also think we can, given an aggregation procedure or a category, work back from v' to figure out at least one implied category or aggregation procedure given some benign assumptions.)

Do you disagree with my description of the "resource gathering agent": http://lesswrong.com/r/discussion/lw/luo/resource_gathering_and_precorriged_agents/

The point here is that M(u-v) might not know what v is, but M(εu+v) certainly does, and this is not the same as maximising an unknown utility function.

Ah, okay. I think I see better what you're getting at. My intuition is that there's a mapping to minimization of a reasonable aggregation of the set of non-negative utilities, but I think I should actually work through some examples before I make any long comments.

I don't think I had read that article until now, but no objections come to mind.

That would be useful to know, if you can find examples. Especially ones where all v and -v have the same probability (which is my current favourite requirement in this area).

Hi Stuart. I'm new here so excuse me if I happen to ask irrelevant or silly questions as I am not as in-depth into the subject as many of you, nor as smart. I found quite interesting the idea of leaving M(u-v) in the ignorance of what u and v are. In such a framework though wouldn't "kill all humans" be considered an acceptable satisficer if u (whatever task we are interested in) is given a much larger utility than v (human lives)? Does it not all boil down to defining the correct trade-off between the utility of u and v so that M(εu+v) vetoes at the right moment?

I'm not sure what you mean. Could you give an example?

Say M(u-v) suggests killing all humans so that it can make more paperclips. u is the value of a paperclip and v is the value of a human life. M(εu+v) might accept it if εΔu > -Δv, so it seems to me at the end it all depends on the relative value we assign to paperclips and human lives, which seems to be the real problem.

That's one of the reasons the agents don't know u and v at this point.

Thanks for your reply, I had missed the fact that M(εu+v) is also ignorant of what u and v are. In this case is this a general structure of how a satisficer should work, but then when applying it in practice we would need to assign some values to u and v on a case by case basis, or at least to ε, so that M(εu+v) could veto? Or is it the case that M(εu+v) uses an arbitrarily small ε, in which case it is the same as imposing Δv>0?

I forgot an important part of the setup, which was that u is bounded, not too far away from the present value, which means εΔu > -Δv is unlikely for general v.

Ah yep that'll do.