AI Scientists: Safe and Useful AI?

There have recently been lots of discussions about the risks of AI, whether in the short term with existing methods or in the longer term with advances we can anticipate. I have been very vocal about the importance of accelerating regulation, both nationally and internationally, which I think could help us mitigate issues of discrimination, bias, fake news, disinformation, etc. Other anticipated negative outcomes like shocks to job markets require changes in the social safety net and education system. The use of AI in the military, especially with lethal autonomous weapons has been a big concern for many years and clearly requires international coordination.

In this post however, I would like to share my thoughts regarding the more hotly debated question of long-term risks associated with AI systems which do not yet exist, where one imagines the possibility of AI systems behaving in a way that is dangerously misaligned with human rights or even loss of control of AI systems that could become threats to humanity. A key argument is that as soon as AI systems can plan and act according to given goals, these goals could be malicious in the wrong hands, or could include or yield indirectly the goal of self-preservation. If an AI has as primary objective to preserve itself, like almost every living entity, its interests may clash with ours. For example, it may consider that some humans would like to turn it off and it would try to avoid that, for example by turning us off or finding a way to control us, none of which is good for humanity.

Main thesis: safe AI Scientists

The bottom line of the thesis presented here is that there may be a path to build immensely useful AI systems that completely avoid the issue of AI alignment, which I call AI Scientists because they are modeled after ideal non-experimental scientists and do not act autonomously in the real world, only focusing on theory generation. The argument is that if the AI system can provide us benefits without having to autonomously act in the world, we do not need to solve the AI alignment problem to achieve those benefits.

This would suggest a policy banning powerful autonomous AI systems that can act in the world (“executives” or “experimentalists” rather than “pure scientists”) unless proven safe. Another option, discussed below is to use the AI Scientist to make other AI systems safe, by predicting the probability of harm that could result from an action. However, such solution swould still leave open the political problem of coordinating people, organizations and countries to stick to such guidelines for safe and useful AI. The good news is that current efforts to introduce AI regulation (such as the proposed bills in Canada and the EU, but see action in the US as well) are steps in the right direction.

The challenge of value alignment

Let us first recap the objective of AI alignment and the issue with goals and subgoals. Humanity is already facing alignment problems: how do we make sure that people and organizations (such as governments and corporations) act in a way that is aligned with a set of norms acting as a proxy for the hard-to-define general well-being of humanity? Greedy individuals and ordinary corporations may have self-interests (like profit maximization) that can clash with our collective interests (like preserving a clean and safe environment and good health for everyone).

Politics, laws, regulations and international agreements all imperfectly attempt to deal with this alignment problem. The widespread adoption of norms which support collective interests is enforced by design in democracies, to an extent, including limitations on the concentration of power by any individual person or corporation, thus avoiding that the self-interest of an individual could yield major collective harm. It is further aided by our evolved tendency for empathy and to adopt prevailing norms voluntarily if we recognize their general value or to gain social approval, even if they go against our own individual interest.

However, machines are not subject to these human constraints and innate programming by default. What if an artificial agent had the cognitive abilities sufficient to achieve major harm under some goals but lacked the innate and social barriers that limit the harm humans can generate? What if a human or a self-preservation interest would make this AI have malicious goals? Can we build AIs that could not have such goals nor the agency to achieve them?

The challenge of AI alignment and instrumental goals

One of the oldest and most influential imagined construction in this sense is Asimov’s set of Laws of Robotics, which request that a robot should not harm a human or humanity (and the stories all about the laws going wrong). Modern reinforcement learning (RL) methods make it possible to teach an AI system through feedback to avoid behaving in nefarious ways, but it is difficult to forecast how such complex learned systems would behave in new situations, as we have seen with large language models (LLMs) like ChatGPT.

We can also train RL agents that act according to given goals. We can use natural language (with modern LLMs) to state those goals, but there is no guarantee that they understand those goals the way we do. In order to achieve a given goal (e.g., “cure cancer”), such agents may make up subgoals (“disrupt the molecular pathway exploited by cancer cells to evade the immune system”) and the field of hierarchical RL is all about how to discover subgoal hierarchies. It may be difficult to foresee what these subgoals will be in the future, and in fact we can expect emerging subgoals to avoid being turned off (and using deception for that purpose).

It is thus difficult to guarantee that such AI agents won’t pick subgoals that are misaligned with human objectives. This is also called the instrumental goal problem and I strongly recommend reading Stuart Russell’s book on the general topic of controlling AI systems: Human Compatible. Russell also suggests a potential solution which would require the AI system to estimate its uncertainty about human preferences and act conservatively as a result (i.e. avoid acting in a way that might harm a human). In addition, recent work shows that with enough computational power and intellect, an AI trained by RL would eventually find a way to hack its own reward signals (e.g., by hacking the computers through which rewards are provided). Such an AI would not care anymore about human feedback and would in fact try to prevent humans from undoing this reward hacking. Another more immediate problem is that we do not know how to program and train an AI such that it cannot then be used by humans with nefarious goals to yield harm, e.g., generating disinformation or instructing the humans how to make bioweapons or cyberattacks. Research on AI alignment should be intensified but what I am proposing here is a solution that avoids these issues altogether, while limiting the type of AI we would design to ones that just propose scientific theories but do not act in the world and have no goals. The same approach may also provide us quantitative safety guarantees if we really need to have an AI that acts in the world.

Training Large Neural Nets for Bayesian Inference

I would like to first outline an approach to building safe and useful AI systems that would completely avoid the issue of setting goals and the concern of AI systems acting in the world (which could be in an unanticipated and nefarious way).

The model for this solution is the idealized scientist, focused on building an understanding of what is observed (also known as data, in machine learning) and of theories that explain those observations. Keep in mind that for almost any set of observations, there will remain some uncertainty about the theories that explain them, which is why an ideal scientist can entertain many possible theories that are compatible with the data.

A mathematically clean and rational way to handle that uncertainty is called Bayesian inference. It involves in principle listing all the possible theories and their posterior probabilities (which can be calculated in principle in a straightforward way, given the data). Below, we conceptually think of just keeping the theories that have a significant probability under the posterior, i.e., those that are compatible with the data and are simpler to express.

The Bayesian posterior automatically puts more weight on the simpler theories that explain the data well (known as Occam’s razor). Although this rational decision-making principle has been known for a long time, the exact calculations are intractable. However, the advent of large neural networks that can be trained on a huge number of examples actually opens the door to obtaining very good approximations of these Bayesian calculations. See [1,2,3,4] for recent examples going in that direction. These theories can be causal, which means that they can generalize to new settings more easily, taking advantage of natural or human-made changes in distribution (known as experiments or interventions). These large neural networks do not need to explicitly list all the possible theories: it suffices that they represent them implicitly through a trained generative model that can sample one theory at a time.

Bayesian calculations also mandates how (in principle) to answer any question in a probabilistic way (called the Bayesian posterior predictive) by averaging the probabilistic answer to any question from all these theories, each weighted by the theory’s posterior probability.

See also my recent blog post on model-based machine learning, which points in the same direction. Such neural networks can be trained to approximate both a Bayesian posterior distribution over theories as well as trained to approximate answers to questions (also known as probabilistic inference or the Bayesian posterior predictive).

What is interesting is that as we make those networks larger and train them for longer, we are guaranteed that they will converge toward the Bayesian optimal answers. There are still open questions regarding how to design and train these large neural networks in the most efficient way, possibly taking inspiration from how human brains reason, imagine and plan at the system 2 level, a topic that has driven much of my research in recent years. However, the path forward is fairly clear and may both eliminate the issues of hallucination and difficulty in multi-step reasoning with current large language models as well as provide a safe and useful AI as I argue below.

AI scientists vs AI agent

Let us give a name to the two Bayesian inference problems described above. We will call AI Scientist the neural network that generates theories according to a distribution that approximates the true Bayesian posterior over theories, P(theory | data). Note that a theory will generally include setting some explanations (known as latent variables in probabilistic machine learning) for each of the observations in the dataset. Another neural network can be trained using solely the AI Scientist as a teacher in order to learn to answer questions given some context. We will call this neural network the AI Agent because the answers to these questions can be used to act in the world and plan to achieve goals, for example if the question is “how do I achieve <some goal>?”. The AI Agent estimates the Bayesian posterior predictive, P(answer | question, data). The AI Scientist encapsulates a Bayesian world model, which could include an understanding of things like harm as interpreted by any particular human, as well as social norms and laws of a particular society. The AI Agent can be used as an oracle or it can be used as a goal-conditioned agent to direct actions in the world, if the “question” includes not only a goal but also sensory measurements that should condition the next action in order to achieve the goal.

The safest kind of AI is the AI Scientist. It has no goal and it does not plan. It may have theories about why agents in the world act in particular ways, including both a notion of their intentions and of how the world works, but it does not have the machinery to directly answer questions like the AI Agent does. One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.

Quantitative Safety Guarantees

Unlike methods to build bridges, drugs or nuclear plants, current approaches to train Frontier AI systems – the most capable AI systems currently in existence – do not allow us to obtain quantitative safety guarantees of any kind. As AIs become more capable, and thus more dangerous in the wrong hands or if we lose control of them, it would be much safer for society and humanity if we could avoid building a very dangerous AI. Current methods of evaluating safety are not very satisfying because they only perform spot checks: they try a finite number of questions asked to the AI and check if the answers could yield harm. There are two problems here. First, what about other contexts and questions for which the AI has not been tested? Second, how do we evaluate that the answer of the AI could yield harm? For the latter question, we can ask humans, but that severely limits the number of questions we can ask. For the first question, we would ideally check if an answer could yield harm before the AI output is actually executed. This would avoid the spot check problem because in the given context and for the given question, one could check whether the proposed action could yield harmful outcomes. But that cannot work practically if that check has to be done by a human, so we need to automate that process. How?

If we had a very capable AI, we could think that it would be able to anticipate the potential harm of executing a particular action (output). However, that would not be safe for the following reason. In general, given any dataset, even an infinite-size one, there are many causal theories that will be compatible with that dataset (unless that dataset also contains the results of an infinite number of experiments on all the possible causal variables, which is impossible, e.g., we cannot move the sun around ourselves). Only one of these theories is correct, and different theories could provide very different answer to any particular question. The way we are currently training Frontier AI systems combines maximum likelihood and reinforcement learning objectives and the resulting neural networks could implicitly rely onto a single theory among those that are compatible with the data, hence they are not safe. What is needed for taking safe decisions is epistemic humility: the AI must know the limits of its own knowledge, so that in case of doubt it avoids actions that could yield major harm according to some of the theories from the Bayesian posterior over theories. If we were able to estimate the Bayesian posterior predictive that answers questions about major harm that could follow any given action in any given context, we could use it to reject actions that could potentially be harmful according to the posterior, e.g., if the probability of major harm is above a tiny but human-chosen threshold. That threshold would give us a quantitative probabilistic guarantee that no major harm could occur following that particular action.

The political challenge

However, the mere existence of a set of guidelines to build safe and useful AI systems would not prevent ill-intentioned or unwitting humans from building unsafe ones, especially if such AI systems could bring these people and their organizations additional advantages (e.g. on the battlefield, or to gain market share) or if these people wanted to see humanity replaced by superhuman AIs (and some people indeed harbor that wish).

That challenge seems primarily political and legal and would require a robust regulatory framework that is instantiated nationally and internationally. We have experience of international agreements in areas like nuclear power or human cloning that can serve as examples, although we may face new challenges due to the nature of digital technologies.

It would probably require a level of coordination beyond what we are used to in current international politics and I wonder if our current world order is well suited for that. What is reassuring is that the need for protecting ourselves from the shorter-term risks of AI should bring a governance framework that is a good first step towards protecting us from the long-term risks of loss of control of AI.

Increasing the general awareness of AI risks, forcing more transparency and documentation, requiring organizations to do their best to assess and avoid potential risks before deploying AI systems, introducing independent watchdogs to monitor new AI developments, etc would all contribute not just to mitigating short-term risks but also helping with longer-term ones.

[1] Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, Yoshua Bengio, “Bayesian Structure Learning with Generative Flow Networks“, UAI’2022, arXiv:2202.13903, February 2022.

[2] Nan Rosemary Ke, Silvia Chiappa, Jane Wang, Anirudh Goyal, Jorg Bornschein, Melanie Rey, Theophane Weber, Matthew Botvinic, Michael Mozer, Danilo Jimenez Rezende, “Learning to Induce Causal Structure“,ICLR 2023, arXiv:2204.04875, April 2022.

[3] Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter, “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second“, ICLR 2023, arXiv:2207.01848, July 2022.

[4] Edward Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio, “GFlowNet-EM for learning compositional latent variable models“, arXiv:2302.06576.