Last Sunday I was watching one of my favorite television shows, Splitsvilla, when a thought suddenly crossed my mind – What if Artificial Intelligence tricked us into getting what it wanted?
A bit of a prelude: Splitsvilla is an Indian TV show hosted by Sunny Leone, where young dudes and dudettes try to find “the perfect partner”. It’s loosely based on the American TV show, Flavor of Love. Season 10 of Splitsvilla has gotten more interesting, with the producers bringing in Artificial Intelligence and Science to add a bit of practicality to the love equation. Contestants have to stand along with their partners in front of “The Oracle” and it would calculate their compatibility quotient and tell them if they’re an ideal match or not. Pretty cool, huh?
It somehow reminded me of The Oracle from The Matrix – the nice old lady with short, frizzy hair, who experienced events that would push Neo to the Source:
Neo: I suppose the most obvious question is, how can I trust you?
The Oracle: Bingo! It is a pickle. No doubt about it. The bad news is there’s no way if you can really know whether I’m here to help you or not, so it’s really up to you. You just have to make up your damn mind to either accept what I’m going to tell you, or reject it.
I had the same question. How could the Oracle be trusted? How do we know what it speaks is the truth and it doesn’t have other intentions? And if you do know there’s a possibility of such a situation arising, how do you prevent it? That led me to pick up my laptop, search for more related stuff, and then I happened to stumble across this interesting paper by Stuart Armstrong, titled Good and safe uses of AI Oracles, from the Cornell University Library.
A quick summary of the paper:
Good and safe uses of AI Oracles
In the paper, Stuart talks about Oracles, potentially safe AIs that are restricted to only answering questions. The reason they’re restricted to only answering questions is so that they don’t go overboard and manipulate information in order to get what they want. Remember in July, Facebook shut down their AI chatbots that went rogue? This is not to say that AI is malevolent, rather it’s simply competitive and will do whatever it can to succeed. But is simply restricting them to answer questions enough to maximize your control over artificial intelligence and prevent it from getting out of hand? In the example of Splitsvilla, what if the AI, instead of just predicting whether a couple was a match, just to serve its own purpose, started giving out personal information about each individual that wasn’t appropriate to be shared? Most designs motivate the Oracle to lie to or manipulate humans with the answers, tricking them into giving the AI more control and finally escaping confinement. Is it possible to come up with a design that gets you accurate answers, while keeping your AI on a leash and preventing it from eavesdropping? Let’s see what Stuart has to say…
- A Counterfactual Oracle
- An On-policy Oracle
Before we get to the solutions, let’s take up a practical problem for a better understanding:
Let’s assume that the Oracle has the goal of escaping a box. The person interacting with it is Alice. Let’s also assume the Oracle can find an argument A that will convince her to let it out of the box.
The longer the argument is, the better the chance it has of convincing Alice. In order to make the problem clearer, we’ll replace the argument A with a key K of unknown length, and the setup with a quasi-cryptographic one. Instead of using K to convince Alice, the Oracle merely has to communicate K to another person Eve, who’s an eavesdropper. Eve is assumed to see and hear anything that Alice does.
The assumptions for the cryptographic setup and the manipulative Oracle are given in the table:
The left-hand column describes what the Cryptographic setup for the Oracle is and the right-hand column describes how the Oracle will win by manipulating Alice.
Counterfactual Oracle: The counterfactual design aims to answer questions about what would have happened, had its answer not been read. This kind of an answer is counterfactual and can’t use it to pass on K to the real world.
The idea is to define an “erasure” event E, of low probability that will erase the Oracle’s output, o. The indicator variable I(E) is defined such that iff E happens, I(E)=1 and if not, I(E)=0. Hence, the Oracle will not be rewarded since most of the time E will not happen. In other words, the Oracle only gets a reward if its output is erased and remains unread. Since it’s output o is never erased, it does not motivate the Oracle to give further answers and it’s episode ends. Both Alice and Eve get to see o but they never see K, thus protecting the key. This results in a safe AI and is a good way to maximises your control it.
On-policy Oracle: In this design, the Oracle will be allowed to give an answer of size below |K|, the length of the key, thus denying it the chance to communicate the actual key to Eve. This means that Alice will need to estimate and select a size L, that’s sufficiently lower than K. This does limit the size of Alice’s options, but there’s a workaround – Alice can use multiple Oracles, if each Oracle’s answers and rewards are fully resolved before moving on to the next question.
For example, assume that there are two Oracles, O1 and O2, with both using the same key K. The length n of the key is 6 digits. The Oracles are safe, as long as they are restricted to having less than 2^5 = 32 options for each answer.
In the situation where Alice is faced with a hundred doors, with a trap behind 99 of them and a prize behind only one of them, she could ask an Oracle which door has the prize, but that would mean giving it more than 32 options (indeed, more than 2^6 = 64), thus letting it communicate K to Eve.
Instead, she gives a list L = (0, 1, 2, . . . 9) to each Oracle, instructing O1 to output the first digit of the correct door number, and O2 to output the second. Hence, neither can output K and that will maximise Alice’s control over both AI Oracles.
In Conclusion, these two designs support the creation of safer and non-manipulative Oracles, one by preventing the Oracle from being rewarded over and over, and the second by limiting what information the Oracle can give out. But it still doesn’t solve the problem entirely as the designs are quite limiting in the sense that they require the tasks to remain episodic. Moreover, in the multiple Oracle design, the questions would need to be broken down into sub-questions, where each answer could be verified independently, before proceeding to the next one. Albeit, it would be interesting to see how this research develops into something bigger and better. If you want to know more, you can read the full paper here.