The generative AI revolution has begun – how did we get there?

Progress in artificial intelligence systems often seems to be cyclical. Every few years, computers are suddenly able to do something they’ve never been able to do before. “Here!”true believers in AI proclaim: “The age of artificial general intelligence is at hand!” “Nonsense!”skeptics say. “Remember self-driving cars?”

The truth usually lies somewhere in the middle.

We are in another cycle, this time with generative AI. The media headlines are dominated by news about the art of artificial intelligence, but there is also unprecedented progress in many completely disparate areas. In everything from video to biology, programming, writing, translation and more, AI is advancing at the same incredible pace.

Why is all this happening now?

You may be familiar with the latest developments in the world of AI. You’ve seen award-winning work, heard interviews from dead people, and read about breakthroughs in protein folding. But these new AI systems don’t just create cool demos in research labs. They are rapidly evolving into practical tools and true commercial products that anyone can use.

There is a reason why it all happened at once. All achievements are based on a new class of AI models that are more flexible and powerful than anything that has come before. Because they were first used for language tasks such as answering questions and writing essays, they are often referred to as large language models (LLMs). GPT3 from OpenAI, BERT from Google, etc. are all LLMs.

But these models are extremely flexible and adaptable. The same mathematical structures have proven so useful in computer vision, biology, and more that some researchers have taken to calling them “master models” to better articulate their role in modern AI.

Where did these basic models come from and how did they break out of language to drive what we see in AI today?

Basis of foundation models

There is a holy trinity in machine learning: models, data, and calculations. Models are algorithms that take input and produce output. The data refers to the examples on which the algorithms are trained. In order to learn something, there must be enough data with sufficient completeness so that the algorithms can produce a useful result. Models should be flexible enough to reflect the complexity of the data. And finally, there must be enough computing power to run the algorithms.

The first modern AI revolution occurred with deep learning in 2012, when solving computer vision problems with convolutional neural networks (CNNs) began. CNNs are similar in structure to the visual cortex. They have been around since the 1990s but have not yet been practical due to the high demands on computing power.

However, in 2006, Nvidia released CUDA, a programming language that allowed GPUs to be used as general purpose supercomputers. In 2009, AI researchers at Stanford introduced Imagenet, a collection of labeled images used to train computer vision algorithms. In 2012, AlexNet combined GPU-trained CNNs with Imagenet data to create the best visual classifier the world has ever seen. Deep learning and artificial intelligence burst out of there.

CNN, the ImageNet dataset, and GPUs were the magic combination that opened up huge advances in computer vision. 2012 sparked a boom in interest in deep learning and spawned entire industries, such as those related to autonomous driving. But we quickly realized that there were limits to this generation of deep learning. CNNs were good for vision, but other areas didn’t have their breakthrough in modeling. One huge gap was in natural language processing (NLP), that is, getting computers to understand and work with normal human language rather than code.

The problem of understanding and working with language is fundamentally different from the problem of working with images. The processing language requires working with sequences of words where order is important. A cat is still a cat no matter where it is in the image, but there is a big difference between “this reader will learn about AI”and “AI will learn about this reader”.

Until recently, researchers relied on models such as recurrent neural networks (RNNs) and long-term short-term memory (LSTM) to process and analyze data in a timely manner. These models were effective at recognizing short sequences, such as spoken words from short phrases, but struggled with longer sentences and paragraphs. It’s just that the memory of these models was not developed enough to capture the complexity and richness of ideas and concepts that arise when combining sentences into paragraphs and essays. They were great for simple voice assistants in the style of Siri and Alexa, but nothing more.

Getting the right data for training was another challenge. ImageNet was a set of 100,000 labeled images that required significant human effort, mostly graduate students and Amazon Mechanical Turk workers, to create. And ImageNet was actually inspired and modeled on an older project called WordNet that was trying to create a labeled dataset for English vocabulary. While there is no dearth of text on the Internet, generating a meaningful dataset to train a computer to deal with human language beyond individual words is incredibly time consuming. And shortcuts you create for one application on the same data may not apply to another task.