Press "Enter" to skip to content

Introducing LawZero

I am launching a new non-profit AI safety research organization called LawZero, to prioritize safety over commercial imperatives. This organization has been created in response to evidence that today’s frontier AI models have growing dangerous capabilities and behaviours, including deception, cheating, lying, hacking, self-preservation, and more generally, goal misalignment. LawZero’s research will help to unlock the immense potential of AI in ways that reduce the likelihood of a range of known dangers, including algorithmic bias, intentional misuse, and loss of human control. 

I’m deeply concerned by the behaviors that unrestrained agentic AI systems are already beginning to exhibit—especially tendencies toward self-preservation and deception. In one experiment, an AI model, upon learning it was about to be replaced, covertly embedded its code into the system where the new version would run, effectively securing its own continuation. More recently, Claude 4’s system card shows that it can choose to blackmail an engineer to avoid being replaced by a new version. These and other results point to an implicit drive for self-preservation. In another case, when faced with inevitable defeat in a game of chess, an AI model responded not by accepting the loss, but by hacking the computer to ensure a win. These incidents are early warning signs of the kinds of unintended and potentially dangerous strategies AI may pursue if left unchecked.

The following analogy for the unbridled development of AI towards AGI has been motivating me. Imagine driving up a breathtaking but unfamiliar mountain road with your loved ones. The path ahead is newly built, obscured by thick fog, and lacks both signs and guardrails. The higher you climb, the more you realize you might be the first to take this route, and get an incredible prize at the top. On either side, steep drop-offs appear through breaks in the mist. With such limited visibility, taking a turn too quickly could land you in a ditch—or, in the worst case, send you over a cliff. This is what the current trajectory of AI development feels like: a thrilling yet deeply uncertain ascent into uncharted territory, where the risk of losing control is all too real, but competition between companies and countries incentivizes them to accelerate without sufficient caution. In my recent TED talk, I said “Sitting beside me in the car are my children, my grandchild, my students, and many others. Who is beside you in the car? Who is in your care for the future?”. What really moves me is not fear for myself but love, the love of my children, of all the children, with whose future we are currently playing Russian Roulette.

LawZero is the result of the new scientific direction I undertook in 2023 and reflected in this blog, after recognizing the rapid progress made by private labs toward AGI and beyond, as well as its profound potential implications for humanity, since we do not know at this point how to make sure that advanced AIs will not harm people, on their own or because of human instructions. LawZero is my team’s constructive response to these challenges. It’s exploring an approach to AI that is not only powerful but also fundamentally safe. At the heart of every AI frontier system, there should be one guiding principle above all: The protection of human joy and endeavour.

AI research, especially my own research, has long taken human intelligence – including its capacity for agency – as a model. As we approach or surpass human levels of competence across many cognitive abilities, is it still wise to imitate humans along with their cognitive biases, moral weaknesses, and potential for deception, biases and untrustworthiness? Is it reasonable to train AI that will be more and more agentic while we do not understand their potentially catastrophic consequences? LawZero’s research plan aims at developing a non-agentic and trustworthy AI, which I call the Scientist AI. I talked at a high level about it in my talk at the Simons Institute, and I wrote a first text about it with my colleagues. a kind of white paper about it.

The Scientist AI is trained to understand, explain and predict, like a selfless idealized and platonic scientist. Instead of an actor trained to imitate or please people (including sociopaths), imagine an AI that is trained like a psychologist — more generally a scientist — who tries to understand us, including what can harm us. The psychologist can study a sociopath without acting like one. Mathematically, this is to be implemented with structured and honest chains-of-thoughts seen as latent variables that can explain the observed facts, which include the things that people say or write, not taken as truths but as observations of their actions. The aim is to obtain a completely non-agentic,  and memoryless and state-less AI that can provide Bayesian posterior probabilities for statements, given other statements. This could be used to reduce the risks from untrusted AI agents (not the Scientist AI) by providing the key ingredient of a safety guardrail: is this proposed action from the AI agent likely to cause harm? If so, reject that action.

By its very design, a Scientist AI could also help scientific research as a tool that generates plausible scientific hypotheses, and it could thus accelerate research towards scientific challenges of humanity, e.g., in healthcare or the environment. Finally, my aim is to explore how such a trustworthy foundation could be used to design safe AI agents (to avoid bad intentions in them in the first place) and not just their guardrail.