Research - Yoshua Bengio

Research Interests

At the beginning of 2023, I began a pivot from research in machine learning aimed at increasing AI capabilities and applying AI for the benefit of society to research on AI safety: what can go possibly wrong as we approach or surpass human-level intelligence with AI, and how do we design AI so that it will not harm humans in the first place?

See this paper for an outline of my long-term research vision to construct safe-by-design AI, which I call the Scientist AI. Recent observations show growing tendencies to deception, cheating, hacking, lying and self-preservation in frontier AIs which illustrate the potential catastrophic risks from misaligned and very capable and agentic AIs in the future. The main training signals in current frontier AIs all give rise to uncontrolled and misaligned agency, from trying imitate people (current LLM pre-training) or pleasing people (current RLHF).

Instead, the Scientist AI is trained to understand, explain and predict, like a selfless idealized and platonic scientist. Instead of an actor trained to imitate or please people (including sociopaths), imagine an AI that is trained like a psychologist — more generally a scientist — who tries to understand us, including what can harm us. The psychologist can study a sociopath without acting like one. Mathematically, this is to be implemented with structured and honest chains-of-thoughts seen as latent variables that can explain the observed facts, which include the things that people say or write, not taken as truths but as observations of their actions. The aim is to obtain a completely non-agentic and memoryless AI that can provide Bayesian posterior probabilities for statements, given other statements. This could be used to reduce the risks from untrusted AI agents (not the Scientist AI) by providing the key ingredient of a safety guardrail: is this proposed action from the AI agent likely to cause harm? If so, reject that action.

By its very design, a Scientist AI could also help scientific research as a tool that generates plausible scientific hypotheses, and it could thus accelerate research towards scientific challenges of humanity, e.g., in healthcare or the environment. Finally, my aim is to explore how such a trustworthy foundation could be used to design safe AI agents (to avoid bad intentions in them in the first place) and not just their guardrail. Indeed, AI agents may become necessary to protect ourselves one day, and of course could also have societal value, if deployed wisely and with human well-being, flourishing and dignity as primary objectives.

Notable Past Research

In the past I worked on learning of deep representations (either supervised or unsupervised), capturing sequential dependencies with recurrent networks and other autoregressive models (including the first neural net language models), understanding credit assignment (including the quest for biologically plausible analogues of backprop, as well as end-to-end learning of complex modular information processing assemblies), meta-learning (or learning to learn), attention mechanisms (which are the key ingredients in the success of Transformers), deep generative models of many kinds, curriculum learning, variations of stochastic gradient descent and why SGD works for neural nets, convolutional architectures, natural language processing (especially with word embeddings, language models and machine translation), understanding why deep learning works so well and what its current limitations are. I worked on many applications of deep learning, including – but not limited to – healthcare (such as medical image analysis and drug discovery), standard AI tasks of computer vision, modeling speech and language and, more recently, robotics.

1989-1998 Convolutional and recurrent networks trained end-to-end with probabilistic alignment (HMMs) to model sequences, as the main contribution of my PhD thesis (1991); NIPS 1988, NIPS 1989, Eurospeech 1991, PAMI 1991, and IEEE Trans. Neural Nets 1992. These architectures were first applied to speech recognition in my PhD (and rediscovered after 2010) and then with Yann LeCun et al to handwriting recognition and document analysis (most cited paper is “Gradient-based learning applied to document recognition”, 1998, with over 15,000 citations in 2018), where we also introduce non-linear forms of conditional random fields (before they were a thing).
1991-1995 Learning to learn papers with Samy Bengio, starting with IJCNN 1991, “Learning a synaptic learning rule”. The idea of learning to learn (particularly by back-propagating through the whole process) has now become very popular, but we lacked the necessary computing power in the early 90’s.
1993-1995 Uncovering the fundamental difficulty of learning in recurrent nets and other machine learning models of temporal dependencies, associated with vanishing and exploding gradients: ICNN 1993, NIPS 1993, NIPS 1994, IEEE Transactions on Neural Nets 1994, and NIPS 1995. These papers have had a major impact and motivated later papers on architectures to aid with learning long-term dependencies and deal with vanishing or exploding gradients. An important but subtle contribution of the IEEE Transactions 1994 paper is to show that the condition required to store bits of information reliably over time also gives rise to vanishing gradients, using dynamical systems theory. The NIPS 1995 paper introduced the use of a hierarchy of time scales to combat the vanishing gradients issue.
1999-2014 Understanding how distributed representations can bypass the curse of dimensionality by providing generalization to an exponentially large set of regions from those comparatively few occupied by training examples. This series of papers also highlights how methods based on local generalization, like nearest-neighbor and Gaussian kernel SVMs, lack this kind of generalization ability. The NIPS 1999 introduced, for the first time, auto-regressive neural networks for density estimation (the ancestor of the NADE and PixelRNN/PixelCNN models). The NIPS 2004, NIPS 2005 and NIPS 2011 papers on this subject show how neural nets can learn a local metric, which can bring the power of generalization of distributed representations to kernel methods and manifold learning methods. Another NIPS 2005 paper shows the fundamental limitations of kernel methods due to a generalization of the curse of dimensionality (the curse of highly variable functions, which have many ups and downs). Finally, the ICLR 2014 paper demonstrates that, in the case of piecewise-linear networks (like those with ReLUs), the regions (linear pieces) distinguished by a one-hidden layer network is exponential in the number of neurons (whereas the number of parameters is quadratic in the number of neurons, and a local kernel method would require an exponential number of examples to capture the same kind of function).
2000-2008 Word embeddings from neural networks and neural language models. The NIPS 2000 paper introduces for the first time the learning of word embeddings as part of a neural network which models language data. The JMLR 2003 journal version expands this (these two papers together get around 3000 citations) and also introduces the idea of asynchronous SGD for distributed training of neural nets. Word embeddings have become one of the most common fixtures of deep learning when it comes to language data and this has basically created a new sub-field in the area of computational linguistics. I also introduced the use of importance sampling (AISTATS 2003, IEEE Trans. on Neural Nets, 2008) as well as of a probabilistic hierarchy (AISTATS 2005) to speed-up computations and face larger vocabularies.
2006-2014 Showing the theoretical advantage of depth for generalization. The NIPS 2006 oral presentation experimentally demonstrated the advantage of depth and is one of the most cited papers in the field (over 2600 citations). The NIPS 2011 paper shows how deeper sum-product networks can represent functions which would otherwise require an exponentially larger model if the network is shallow. Finally, the NIPS 2014 paper on the number of linear regions of deep neural networks generalizes the ICLR 2014 paper mentioned above, showing that the number of linear pieces produced by a piecewise linear network grows exponentially in both width of layers and number of layers, i.e., depth, making the functions represented by such networks generally impossible to capture efficiently with kernel methods (short of using a trained neural net as the kernel).
2006-2014 Unsupervised deep learning based on auto-encoders (with the special case of GANs as decoder-only models, see below). The NIPS 2006 paper introduced greedy layer-wise pre-training, both in the supervised case and unsupervised case with auto-encoders. The ICML 2008 paper introduced denoising auto-encoders and the NIPS 2013, ICML 2014 and JMLR 2014 papers cast their theory and generalize them as proper probabilistic models, at the same time introducing alternatives to maximum likelihood as training principles.
2014 Dispelling the local-minima myth regarding the optimization of neural networks, with the NIPS 2014 paper on saddle points, and demonstrating that it is the large number of parameters which makes it very unlikely that bad local minima exist.
2014 Introducing Generative Adversarial Networks (GANs) at NIPS 2014, which introduced many innovations in training deep generative models outside of the maximum likelihood framework and even outside of the classical framework of having a single objective function (instead entering into the territory of multiple models trained in a game-theoretical way, each with their objective). Presently one of the hottest research areas in deep learning with over 6000 citations mostly from papers that introduce variants of GANs, which have been producing impressively realistic synthetic images one would not have imagined computers being able to generate just a few years ago.
2014-2016 Introducing content-based soft attention and the breakthrough it brought to neural machine translation, mostly with Kyunghyun Cho and Dima Bahdanau. First introduced the encoder-decoder (now called sequence-to-sequence) architecture (EMNLP 2014) and then achieved a big jump in BLEU scores with content-based soft attention (ICLR 2015). These ingredients are now the basis of most commercial machine translation systems, another entire sub-field created using these techniques.