What can’t we afford with a future superintelligent AI? Among others, confidently wrong predictions about the harm that some actions could yield. Especially catastrophic harm. Especially if these actions could spell the end of humanity.

How can we design an AI that will be highly capable and will not harm humans? In my opinion, we need to figure out this question – of controlling AI so that it behaves in really safe ways – before we reach human-level AI, aka AGI; and to be successful, we need all hands on deck. Economic and military pressures to accelerate advances in AI capabilities will continue to push forward even if we have not figured out how to make superintelligent AI safe. And even if some regulations and treaties are put into place to reduce the risks, it is plausible that human greed for power and wealth and the forces propelling competition between humans, corporations and countries, will continue to speed up dangerous technological advances.

Right now, science has no clear answer to this question of AI *control* and how to *align* its intentions and behavior with democratically chosen values. It is a bit like in the “Don’t Look Up” movie. Some scientists have arguments about the plausibility of scenarios (e.g., see “Human Compatible“) where a planet-killing asteroid is headed straight towards us and may come close to the atmosphere. In the case of AI there is more uncertainty, first about the probability of different scenarios (including about future public policies) and about the timeline, which could be years or decades according to leading AI researchers. And there are no convincing scientific arguments which contradict these scenarios and reassure us for certain, nor is there any known method to “deflect the asteroid”, i.e., avoid catastrophic outcomes from future powerful AI systems. With the survival of humanity at stake, we should invest massively in this scientific problem, to understand this asteroid and discover ways to deflect it. Given the stakes, our responsibility to humanity, our children and grandchildren, and the enormity of the scientific problem, I believe this to be the most pressing challenge in computer science that will dictate our collective wellbeing as a species. Solving it could of course help us greatly with many other challenges, including disease, poverty and climate change, because AI clearly has beneficial uses. In addition to this scientific problem, there is also a political problem that needs attention: how do we make sure that no one triggers a catastrophe or takes over political power when AGI becomes widely available or even as we approach it. See this article of mine in the Journal of Democracy on this topic.

In this blog post, I will focus on an approach to the scientific challenge of AI control and alignment. Given the stakes, I find it particularly important to focus on approaches which give us the strongest possible AI safety guarantees. Over the last year, I have been thinking about this and I started writing about it in this May 2023 blog post (also see my December 2023 Alignment Workshop keynote presentation). Here, I will spell out some key thoughts that came out of a maturation of my reflection on this topic and that are driving my current main research focus. I have received funding to explore this research program and **I am looking for researchers motivated by existential risk and with expertise in the span of mathematics (especially about probabilistic methods), machine learning (especially about amortized inference and transformer architectures) and software engineering (especially for training methods for large scale neural networks).**

I will take as a starting point of this research program the following question: if we had enough computational power, could it help us design a provably safe AGI? I will briefly discuss below a promising path to approximate this ideal, with the crucial aim that as we increase computational resources or the efficiency of our algorithms, we obtain greater assurances about safety.

First, let me justify the Bayesian stance – or any other that accounts for the uncertainty about the **explanatory hypotheses** for the data and experiences available to the AI. Note that this epistemically humble posture or admitting any explanatory hypothesis that is not contradicted by the data is really at the heart of the scientific method and ethics, and motivated my previous post on the “Scientist AI“. Maximum likelihood and RL methods can zoom in on *one* such explanatory hypothesis (e.g., in the form of a neural network and its weights that fit the data or maximize rewards well) when in fact the theory of causality tells us that even with infinite observational data (not covering all possible interventions), there can exist multiple causal models that are compatible with the data, leading to ambiguity about which is the true one. Each causal model has a causal graph specifying which variable is a direct cause of which other variable, and the set of causal graphs compatible with a distribution is called the Markov equivalence class. Maximum likelihood and RL are likely to implicitly pick one explanatory hypothesis H and ignore most of the other plausible hypotheses (because nothing in their training objective demands otherwise). “*Implicitly*“, because for most learning methods, including neural networks, we do not know how to have an explicit and interpretable access to the innards of H. If there are many explanatory hypotheses for the data (e.g., different neural networks that would fit the data equally well), it is likely that the H picked up by maximum likelihood or RL will not be the correct one or a mixture containing the correct one because any plausible H or mixture of them (and there could be exponentially many) would maximize the likelihood or reward.

Why is that a problem, if we have a neural net that fits the data well? *Not taking into account the existence of other H’s would make our neural network sometimes confidently wrong*, and it could be about something very important for our survival. Serious out-of-distribution failures are well documented in machine learning, but for now do not involve decisions affecting the fate of humanity. To avoid catastrophic errors, now consider a risk management approach, with an AI that represents not a single H but a large set of them, in the form of a generative distribution over hypotheses H. Hypotheses could be represented as computer programs (which we know can represent any computable function). By not constraining the size and form of these hypotheses, we are confident that a correct explanation, at least one conceivable by a human, is included in that set. However, we may wish to assign more probability to simpler hypotheses (as per Occam’s Razor). Before seeing any data, the AI can therefore weigh these hypotheses by their description length L in some language to prefer shorter ones, and form a corresponding Bayesian prior P(H) (e.g. proportional to 2^{-L}). This would include a “correct” hypothesis H*, or at least the best hypothesis that a human could conceive by combining pieces of theories that humans have expressed and that are consistent with data D. After seeing D, only a tiny fraction of these hypotheses would remain compatible with the data, and I will call them plausible hypotheses. The Bayesian posterior P(H | D) quantifies this: P(H | D) is proportional to the prior P(H) times how well H explains D, i.e., the likelihood P(D | H). The process of scientific discovery involves coming up with such hypotheses H that are compatible with the data, and learning P(H | D) would be like training an AI to be a good scientist that spits out scientific papers that provide novel explanations for observed data, i.e., plausible hypotheses. Note that the correct hypothesis, H*, by definition must be among the plausible ones, since it is the best possible account of the data, and with Occam’s Razor hypothesis we can assume that it has a reasonable and finite description length. We will also assume that the data used to train our estimated posterior is genuine and not consistently erroneous (otherwise, the posterior could point to completely wrong conclusions).

There is a particularly important set of difficult-to-define concepts for a safe AI, which characterize what I call ** harm** below. I do not think that we should ask humans to label examples of harm because it would be too easy to overfit such data. Instead we should use the Bayesian inference capabilities of the AI to entertain all the plausible interpretations of harm given the totality of human culture available in D, maybe after having clarified the kind of harm we care about in natural language, for example as defined by a democratic process or documents like the beautiful UN Universal Declaration of Human Rights.

*If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.*

Based on this observation we can decompose our task in two parts: first, *characterize the set of plausible hypotheses* – this is the Bayesian posterior P(H | D); second, given a context *c* and a proposed action *a*, *consider plausible hypotheses which predict harm*. This amounts to looking for an H for which P(H, harm | *a*, *c*, D)>threshold. If we find such an H, we know that this action should be rejected because it is unsafe.** If we don’t find such a hypothesis then we can act and feel assured that harm is very unlikely, with a confidence level that depends on our threshold and the goodness of our approximation**.

Note that with more data, the set of hypotheses compatible with the data (those that have a high probability under P(H | D)), will tend to shrink – exponentially, in general. However, with the space of hypotheses being infinite in the first place, we may still end up with a computationally intractable problem. The research I am proposing regards how we could approximate this tractably. We could leverage the existing and future advances in machine learning (ML) based on the work of the last few decades, in particular *our ability to train very large neural networks to minimize a training objective*. The objective is that safety guarantees will converge to an exact upper bound on risk as the amount of available compute and the efficiency of our learning methods increase.

The path I am suggesting is based on *learned amortized inference*, in which we train a neural network to estimate the required conditional probabilities. Our state-of-the-art large language models (LLMs) can learn very complex conditional distributions and can be used to sample from them. What is appealing here is that we can arbitrarily improve the approximation of the desired distributions by making the neural net larger and training it for longer, without necessarily increasing the amount of observed data. In principle, we could also do this with non-ML methods, such as MCMC methods. The advantage of using ML is that it may allow us to be a lot more efficient by exploiting regularities that exist in the task to be learned, by generalizing across the exponential number of hypotheses we could consider. We already see this at play with the impressive abilities of LLMs although I believe that their training objective is not appropriate because it gives rise to confidently wrong answers. This constitutes a major danger for humans when the answers are about what it is that many humans would consider unacceptable behavior.

We can reduce the above technical question to (1) how to learn to approximate P(H | harm, *a*, *c*, D) for all hypotheses H, actions *a*, and contexts *c* and for the given data D, while keeping track of the level of approximation error, and (2) find a proof that there is no H for which P(H, harm | *a*, *c*, D)>threshold, or learn excellent heuristics for identifying H’s that maximize P(H, harm | *a*, *c*, D), such that a failure to find an H for which P(H, harm | *a*, *c*, D)>threshold inspires confidence that none exist. These probabilities can be in principle deduced from the general posterior P(H | D) through computations of marginalization that are intractable but that we intend to approximate with large neural networks.

Part of the proposed research is to overcome the known inefficiency of Bayesian posterior inference needed for (1). The other concerns the optimization problem (2) of finding a plausible hypothesis that predicts major harm with probability above some threshold. It is similar to worst-case scenarios that sometimes come to us: a hypothesis pops in our mind that is plausible (not inconsistent with other things we know) and which would yield a catastrophic outcome. When that happens, we become cautious and hesitate before acting, sometimes deciding to explore a different, safer path, even if it might delay (or reduce) our reward. To imitate that process of generating such thoughts, we could take advantage of our estimated conditionals to make the search more efficient: we can approximately sample from P(H | harm, *a*, *c*, D). With a Monte-Carlo method, we could construct a confidence interval around our safety probability estimate, and go for an appropriately conservative decision. Even better would be to have a neural network construct a mathematical proof that there exists no such H, such as a branch-and-bound certificate of the maximum probability of harm, and this is the approach that my collaborator David Dalrymple proposes to explore. See the research thesis expected to be funded by the UK government within ARIA that spells out the kind of approach we are both interested in.

An important issue to tackle is that the neural networks used to approximate conditional probabilities can still make wrong predictions. We can roughly divide errors into three categories: (a) missing modes (missing high-probability hypotheses), (b) spurious modes (including incorrect hypotheses), and (c) locally inaccurate probability estimation (we have the right hypotheses, but the numerical values of their probabilities are a little bit inaccurate). Inaccurate probabilities (c) could be fixed by additional tuning of the neural network, and we could estimate these inaccuracies by measuring our training errors, and then use them to construct confidence intervals around our estimated probabilities. Only having spurious modes (b) would not be too worrisome in our context because it could make us more conservative than we should: we could reject an action due to an implausible hypothesis H that our model considers as plausible, when H wrongly predicts catastrophic harm. Importantly, the correct hypothesis H* would still be among those we consider for a possible harmful outcome. Also, some training methods would make spurious modes unlikely; for example, we can sample hypotheses from the neural net itself and verify if they are consistent with some data, which immediately provides a training signal to rule them out.

The really serious danger we have to deal with in the safety context is (a), i.e., missing modes, because it could make our approximately Bayesian AI produce confidently wrong predictions about harm (although less often than if our approximation of the posterior was a single hypothesis, as in maximum likelihood or standard RL). If we could consider a mode (a hypothesis H for which the exact P(H|D) is large) that the current model does not see as plausible (the estimated P(H|D) is small), then we could measure a training error and correct the model so that it increases the estimated probability. However, sampling from the current neural net unfortunately does not reveal the existence of missing modes, since the neural net assigns them very small probability in the first place and would thus not sample them. This is a common problem in RL and has given rise to exploration methods but we will apply these methods in the exploration in the space of hypotheses, not the space of real-world actions: we want to sample hypotheses not just from our current model but also from a more exploratory generative model. This idea is present in RL and also in the research on off-policy training of amortized inference networks. Such methods can explore where we have not yet gone or where there are clues that we may have missed a plausible hypothesis. As argued below, we could also considerably reduce this problem if the AI could at least consider the hypotheses that humans have generated in the past, e.g., in human culture and especially in the scientific literature.

A nice theoretical reassurance is that we could in principle drive those training errors to zero with more computational resources. What is nice with the proposed Bayesian posterior approximation framework is that, at run-time, we can continue training or at the very least estimate the error made by the neural network through a sampling process. This is similar to how AlphaGo can refine its neural net prediction by running a bunch of stochastic searches for plausible downstream continuations of the game. In human terms, this would be like taking the time to think harder when faced with a tricky situation where we are not sure of what to do, by continuing to sample relevant possibilities in our head and adjusting our estimates of what could happen accordingly.

Yet another way to decrease the risks associated with an insufficiently trained neural network is to make the AI-generated hypotheses somewhat human-readable. This could be achieved by using a regularizer to encourage the AI to generate interpretable hypotheses, i.e., ones that can be converted to natural language and back with as little error as possible, and vice-versa (such that human theories expressed in natural language can be expressed as statements in the AI internal language for hypotheses). At the very least, if we cannot convert the full theory to a human-interpretable form, we could make sure that the concepts involved in the theory are interpretable, even if the relationships between concepts may not always be reducible to a compact verbalizable form. However, because a small number of discrete statements would have a much smaller description length, the AI training procedure should favor interpretable explanations. This would allow human inspection of the explanations generated by the AI. Instead of trying to interpret neural net activations, we would only require that the sequences of outputs generated by the AI be interpretable or as interpretable as possible. This would favor the set of theories about the world that humans can understand, but that space is extremely expressive: it includes all existing scientific theories. Some pieces of these theories could however be implicit, for example the result of applying an algorithm. AI theories could refer to existing math and computer science knowledge in order to explain the data more efficiently: think about algorithms that approximate quantum physics in order to characterize chemical properties. Although the quantum physics laws can be spelled out compactly, there is no tractable solution to questions involving more than a few atoms, and chemistry is all about approximations and fit to larger-scale data for which exact quantum calculations are infeasible, sometimes purely numerical, but involving variables and concepts that can be defined and named.

Interestingly, human theories would generally have a better prior (i.e., would be preferred by the AI) than completely novel ones because their description length could be reduced to identifying their index in the encyclopedia of human knowledge, e.g., by quoting the bibliographic reference of a corresponding scientific paper or wikipedia entry. On the other hand, novel theories would have to be specified from the much larger universe of all the possible theories. For the same reason, a novel theory that can build on existing human concepts and theories would be more compact to describe and would be more likely to be chosen a priori (according to P(H))) by the AI than one that requires a lot of new definitions and concepts. A major advantage of making AI hypotheses interpretable and human theories translatable in the AI language for hypotheses is that it might allow us to obtain approximate guarantees that the AI estimate of P(H|D) could not be missing a theory (a missing mode) that already belongs in the human corpus of theories. So we could have missing modes, but only for theories that humans also miss.

Furthermore, consider that the same P(H|D) is used both to obtain actions, e.g., from P(action | outcome, context, D), and to obtain the cautious rejection of unsafe actions via sampling of P(H | harm, action, context, D). Hence, if a mode is missing in P(H|D), it is missing for both conditional predictions (which come from the same neural network but used with different inputs). Hence, if the AI can read all the human-generated theories (for example by reading all the scientific literature), we could make sure that these partial hypotheses are all included as explanatory candidates for the data D. Only those human theories that are compatible with the data available to the AI would remain among the plausible ones under the estimated P(H | D), i.e., we can cull out the human-generated bad theories that are not even compatible with data (like conspiracy theories and incoherent blabber that populate much of our internet). As a consequence, we would be assured that if any human would have predicted harm using any of the plausible human-generated theories, so would the AI’s approximate Bayesian posteriors over theories. The AI could also discover modes (plausible hypotheses) not known by humans, i.e., new scientific theories, but at least it would have absorbed all human hypotheses about how the world works, culling out those that are incoherent or inconsistent with data. This is very different from an LLM which just mimics the distribution of the text in its training corpus. Here we are talking about *explanations for the data*, which cannot be inconsistent with the data because the data likelihood P(D|H) computed given such an interpretation would otherwise vanish, nor be internally inconsistent because P(H) would vanish. If either P(D|H) or P(H) vanish, then the posterior P(H|D) vanishes and the AI would be trained to not generate such H’s.

A particular kind of explanation for the data is a causal explanation, i.e., one that involves a graph of cause-and-effect relationships. Our neural net generating explanations could also generate such graphs (or partial graphs in the case of partial explanations), e.g., as we have shown on a small scale already. Causal explanations should be favored in our prior P(H) because they will be more robust to changes in distribution due to actions by agents (humans, animals, AIs), and they properly account for actions, not just as arbitrary random variables but ones that interfere with the default flow of causality – they are called “interventions”. Causal models are unlike ordinary probabilistic (graphical) models in that they include the possibility of interventions on any subset of the variables. An intervention gives rise to a different distribution without changing any of the parameters in the model. A good causal model can thus generalize out-of-distribution, to a vast set of possible distributions corresponding to different interventions. Even a computer program can be viewed under a causal angle, when one allows interventions on the state variables of the program, which thus act like the nodes of a causal graph.

This post only provides a high-level overview of the research program that I propose, and much remains to be done to achieve the central goal of efficient and reliable probabilistic inference over potentially harmful actions with the crucial desideratum of increasing the safety assurance when more computational power is provided, either in general or in the case of a particular context and proposed action. We don’t know how much time is left before we pass a threshold of dangerous AI capabilities, so advances in AI alignment and control are urgently needed.