Is Scale All You Need
3 Key Papers
For AI models in the next few years, is scale all you need for continued progress? Let’s investigate!
If you’ve been keeping up with AI, you might have noticed that AI bros and Gym Bros, had one goal in common, getting huge.
So when I looked into it, I saw a lot of different opinions and bits, from experts and the public. So I wanted to find out in a focused manner, what exactly is the current consensus. And where are companies betting their revenue on.
The reason this is important is our lives are becoming increasingly dependent on AI, whether you are in the field of AI or not. And I think it would benefit everyone to have at least a rough idea of what form, shape, speed AI progress is gonna come in, in the next few years.
So as one does, I did a deep dive into this topic, reading countless articles, papers and interviews. But the biggest task of all of this was actually trimming away all tangential stuff.
These are some of the interesting topics that I had to intentionally omit, but have to come back to another day.
- AI alignment research
- Mechanistic interpretability
- Progress, AGI, superintelligence timelines
- Inverse scaling
- And many more
Eventually, I found 3 important papers that helped me better understand where scaling is going. But in order to discuss those, I first have to explain a few basic concepts.
The Eras Tour
I would like to divide the field of AI into 5 distinct eras. Note that there are many overlapping and other important approaches being developed concurrently but for the sake of simplicity, let’s divide it up like this:
Before 2012, was the Age of Machine Learning.
From 2012 to 2017 was the Age of Deep Learning.
From 2017 to 2021 was the Age of the Transformer.
From 2021 to 2024 was the Age of Scaling.
From 2024 to roughly 2030, IDK?
Will it be the age of Synthetic Data, Massive Compute, Agent Environments, New Architectures or Continuation of Model Scale.
Really this is what I wanna find out.
Y Scale Tho?
In 2019, Richard Sutton criticized the field of AI research for failing “to learn the bitter lesson”. And that bitter lesson was that using lots of computational power often works better than trying to program human knowledge into AI.
Sutton explained that smart researchers often try to make AI think like humans. But over and over, what really works is using more powerful computers that try to learn.
For example, early chess programs tried to copy how human chess masters think. But in the end, what worked best was having computers do a deep search of all the possible moves.
With language understanding, initial efforts were focused heavily on parsing grammar, and trying to use traditional linguistics to program a system that could understand language. Ultimately what worked was more computation and massive data sets that learned language.
In defense of the expert system researchers, it is often the case that computing has always been very expensive in the past, and it often feels like you can make faster progress with programming. But in the long term, due to Moore’s law, computers get more and more powerful, available and cheap.
So if expert systems fail time and again, what are some methods that actually deliver?
Across domains like computer vision, chess, go, video games and language, methods that have worked the best have been search and learning. Which are both compute intensive.
The main takeaway from Richard Sutton is, instead of trying to manually enter the information. Build systems that can learn the information. And we can still use expert systems to evaluate AI systems.
I believe this is a generally agreed upon principle by which most researchers conduct their work nowadays.
Thanks to Moore’s Law, we have more compute, so we can start to increase our model size and dataset size.
Empirical Observation
It’s generally agreed that model size and dataset size lead to better performance.
Biology Analogy: Biomimicry Says So
It’s also worth mentioning that we see a similar pattern in biology, where the more neurons we have in language areas, the more capable the animal becomes at processing language. It’s also cool to see that GPT4 is on a similar scale to human language areas.
You know, I’m also something of a large language model myself.
The implication from all of this is that, the bigger the model scale, the more performance you can get from it.
The way I like to understand it now, is that the model size is the capacity to learn. Dataset size is the information to learn.
Kaplan Scaling Laws: Kaplan Says So
The main key finding from this paper was that performance improves as compute and dataset size increase, with more emphasis on compute.
Also it suggested that larger models are more sample efficient. This influenced the development of GPT3 and other large language models.
Chinchilla Scaling Laws: Data Is Important
This paper from 2022, was more focused on finding the optimal ratio between model size and dataset size.
And the findings were that optimal model size scales linearly with dataset size. And that most large language models are undertrained.
This means that the number of tokens or dataset size it was trained on was much smaller than the model’s capacity to learn.
To put this in perspective. They are giving a ratio of 20 tokens per parameter. So something like GPT3 which had 175 billion parameters, could have been trained on 3.5 trillion tokens. However in reality, it was trained on 300 billion tokens.
GPT3 gives us a ratio of 1.76 tokens per parameter.
GPT4 is rumored to have had 1.8 trillion parameters and 13 trillion tokens. Which gives us 7.22 tokens per parameter.
Which shows improved utilization.
Another way to put this, is to say, that as of right now, quality and quantity of data is the biggest bottleneck.
Takeaways From Both Papers
Kaplan Scaling Laws basically said, a bigger model means more performance.
And Chinchilla Scaling Laws provided how to scale dataset size with model size.
Both provide a guide for us on how to understand how AI companies are scaling their AI systems.
Combining these we can also come to the conclusion that we can build a much more efficient and smaller model that will be just as capable if not more than a large model that uses a smaller dataset.
Then there’s the issue of compute.
Is Compute A Block?
Since AI research right now is a very empirical science, where researchers come up with a hypothesis and they wanna test it. And the way to test it, is to run a significantly sized neural network. And since compute is very expensive, companies will often want to use their limited compute time to training models that will generate money someday.
But they also don’t wanna be left behind, and there’s huge incentives to beat the competition to new findings.
This is the main compute bottleneck.
Everybody Wants to Rule the Data
Good song.
So in summary. Expensive compute is bottlenecking research. And quality & quantity of data is bottlenecking model performance.
Remember how I said GPT4 was using 13 trillion tokens and from their GPT3 which was using 300 billion. While that may seem like a big jump in data, in reality OpenAI was using some of the data repeatedly on each of the different export models.
I even heard that sampling high quality reasoning text multiple times lead to increased reasoning capabilities.
And as of a few days ago, OpenAI made a deal with Time Magazine to use all of their data. So we may increasingly see people with proprietary data sell their data to AI companies. And that would be the single most important ingredient to AI companies.
The Real Question
I think by now we can all agree that scaling model size with high quality data on a ratio of 1:20, will probably keep improving AI.
But if we keep doing this, the performance graph as we scale both data and model size, will it be exponential, linear or logarithmic. And where are we on that graph?
Well there was another paper that wanted to find this out.
The ultimate test of intelligence and usefulness from these models would be the zero shot learning capability. Basically can the AI be used in a completely new scenario and reason well enough to give the correct answer.
And their key finding was the logarithmic case.
What does this all mean?
Well first, where are we on the graph, because if we’re really early in the graph, that means we have probably several generations before we start to see a plateau.
This is something I have yet to find any information on. If you do, please kindly leave a comment below with sources.
Future
However there are still other lessons we could take from this.
This finding suggests the need for different approaches beyond just scaling data and model size.
And if you like thinking about news as signals. I’m seeing a lot of people move to the Agent Space. So that may be the next era of AI.
If you’d like to learn more about that please leave a like and subscribe. And if you want me to research any topic you think is interesting, leave a comment and I’ll do my best!