Three Dynamics of AI Development
Before I make my policy argument, I’m going to describe three basic dynamics of AI systems that it’s crucial to understand:
- Scaling laws. A property of AI — which I and my co-founders were among the first to document back when we worked at OpenAI — is that all else equal, scaling up the training of AI systems leads to smoothly better results on a range of cognitive tasks, across the board. So, for example, a $1M model might solve 20% of important coding tasks, a $10M might solve 40%, $100M might solve 60%, and so on. These differences tend to have huge implications in practice — another factor of 10 may correspond to the difference between an undergraduate and PhD skill level — and thus companies are investing heavily in training these models.
- Shifting the curve. The field is constantly coming up with ideas, large and small, that make things more effective or efficient: it could be an improvement to the architectureof the model (a tweak to the basic Transformer architecture that all of today’s models use) or simply a way of running the model more efficiently on the underlying hardware. New generations of hardware also have the same effect. What this typically does is shift the curve: if the innovation is a 2x “compute multiplier” (CM), then it allows you to get 40% on a coding task for $5M instead of $10M; or 60% for $50M instead of $100M, etc. Every frontier AI company regularly discovers many of these CM’s: frequently small ones (~1.2x), sometimes medium-sized ones (~2x), and every once in a while very large ones (~10x). Because the value of having a more intelligent system is so high, this shifting of the curve typically causes companies to spend more, not less, on training models: the gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company’s financial resources. People are naturally attracted to the idea that “first something is expensive, then it gets cheaper” — as if AI is a single thing of constant quality, and when it gets cheaper, we’ll use fewer chips to train it. But what’s important is the scaling curve: when it shifts, we simply traverse it faster, because the value of what’s at the end of the curve is so high. In 2020, my team publisheda paper suggesting that the shift in the curve due to algorithmic progress is ~1.68x/year. That has probably sped up significantly since; it also doesn’t take efficiency and hardware into account. I’d guess the number today is maybe ~4x/year. Another estimate is here. Shifts in the training curve also shift the inference curve, and as a result large decreases in price holding constant the quality of model have been occurring for years. For instance, Claude 3.5 Sonnet which was released 15 months later than the original GPT-4 outscores GPT-4 on almost all benchmarks, while having a ~10x lower API price.