Notice how all the major AI companies (at least the ones that don't do open releases) stopped telling us how many parameters their models have. Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.
And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.
Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements
I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old
You're hitting on something really important that barely gets discussed. For instance, notice how opus 4.5's speed essentially doubled, bringing it right in line with the speed of sonnet 4.5? (sonnet 4.6 got a speed bump too, though closer to 25%).
It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.
I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in
this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers.
By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.
These AI companies are all in the same boat. At current operating costs and profit margins they can't hope to pay back the investment, so they have to pull tricks like rebranding models and downgrading offerings silently.
There's no oversight of this industry. The consumer protection dept in the US was literally shut down by the administration, and even if they had not been, this technology is too opaque for anyone to really be able to tell if today they're giving you a lower model than what you paid for yesterday.
I'm convinced they're all doing everything they can in the background to cut costs and increase profits.
I can't prove that Gemini 3 is dumber than when it came out because of the non deterministic nature of this technology, but it sure feels like it.
opus 4.6 was going to be sonet 5 up until week of release. The price bump is even bigger than you realize because they don't let you run opus 4.6 at full speed unless you pay them an extra 10x for the new "fast mode"
It kind of makes sense, at least a year or so ago, I know $20.00 unlimited plans were costing these companies ~$250.00 averaged out, they're still lighting money on fire with $200.00 but probably not nearly as bad, however, I'm not sure if costs have gone up with changes in models, seems like the agentic tooling is more expensive for them (hence why they're pushing anyone they can to pay per token).
Similar trend in open text-to-image models: Flux.1 was 12B but now we have 6B models with much better quality. Qwen Image goes from 20B to 7B while merging the edit line and improving quality. Now that the cost of spot H200s at 140GB came down to A100 levels, you can finally try larger scale finetuning/distillation/rl with these models. Very promising direction for open tools and models if the trend continues.
> I doubt frontier models have actually substantially grown in size in the last 1.5 years
... and you'd be most likely very correct with your doubt, given the evidence we have.
What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.
I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.
Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.
They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.
[1] for lack of a better term that I am not aware of.
Obviously, there’s a limit to how much you can squeeze into a single parameter. I guess the low-hanging fruit will be picked up soon, and scaling will continue with algorithmic improvements in training, like [1], to keep the training compute feasible.
I take "you can't have human-level intelligence without roughly the same number of parameters (hundreds of trillions)" as a null hypothesis: true until proven otherwise.
Why don't we need them? If I need to run a hundred small models to get a given level of quality, what's the difference to me between that and running one large model?
You can run smaller models on smaller compute hardware and split the compute. For large models you need to be able to fit the whole model in memory to get any decent throughput.
It's unfair to take some high number that reflects either disagreement, or assumes that size-equality has a meaning.
> level of quality
What is quality, though? What is high quality, though? Do MY FELLOW HUMANS really know what "quality" is comprised of? Do I hear someone yell "QUALITY IS SUBJECTIVE" from the cheap seats?
I'll explain.
You might care about accuracy (repetition of learned/given text) more than about actual cognitive abilities (clothesline/12 shirts/how long to dry).
From my perspective, the ability to repeat given/learned text has nothing to do with "high quality". Any idiot can do that.
Here's a simple example:
Stupid doctors exist. Plentifully so, even. Every doctor can pattern-match symptoms to medication or further tests, but not every doctor is capable of recognizing when two seemingly different symptoms are actually connected. (simple example: a stiff neck caused by sinus issues)
There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.
There are plenty of benefits for running multiple models in parallel. A big one is specialization and caching. Another is context expansion. Context expansion is what "reasoning" models can be observed doing, when they support themselves with their very own feedback loop.
One does not need "hundred" small models to achieve whatever you might consider worthy of being called "quality". All these models can not only reason independently of each other, but also interact contextually, expanding each other's contexts around what actually matters.
They also don't need to learn all the information about "everything", like big models do. It's simply not necessary anymore. We have very capable systems for retrieving information and feeding them to model with gigantic context windows, if needed. We can create purpose-built models. Density/parameter is always increasing.
Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts, can disseminate multiple perspectives on a matter in parallel, boosting context expansion massively.
A single model cannot refactor its own path of thoughts during an inference run, thus massively increasing inefficiency. A single model can only provide itself with feedback one after another, while multiple models can do it all in parallel.
See ... there's two things which cover the above fundamentally:
1. No matter how you put it, we've learned that models are "smarter" when there is at least one feedback-loop involved.
2. No matter how you put it, you can always have yet another model process the output of a previously run model.
These two things, in combination, strongly indicate that multiple small, high-efficiency models running in parallel, providing themselves with the independent feedback they require to actually expand contexts in depth, is the way to go.
Or, in other words:
Big models scale Parameters, many small models scale Insight.
The corollary to the bitter lesson is that in any market meaningful time scale a human crafted solution will outperform one which relies on compute and data. It's only on time scales over 5 years that your bespoke solution will be over taken. By which point you can hand craft a new system which uses the brute force model as part of it.
Repeat ad-nauseam.
I wish the people who quote the blog post actually read it.
Bitter Lesson is about exploration and learning from experience. So RL (Sutton's own field) and meta learning. Specialized models are fine from Bitter Lesson standpoint if the specialization mixture is meta learned / searched / dynamically learned&routed.
I'd suggest that a measure like 'density[1]/parameter' as you put it will asymptotically rise to a hard theoretical limit (that probably isn't much higher than what we have already). So quite unlike Moore's Law.
It's the same thing. Quantize your parameters? "Bigger" model runs faster. MOE base model distillation? "Bigger" model runs as smaller model.
There is no gain for anyone anywhere by reducing parameter count overall if that's what you mean. That sounds more like you don't like transformer models than a real performance desire
Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There's loads of more "traditional" .gguf options (well, quants) that are practical even on shockingly weak hardware, and I've been seeing things that give me hope that diffusion is the next step forward, but so far it's all been early research prototypes.
Based on my experience running diffusion image models I really hope this isn't going to take over anytime soon. Parallel decoding may be great if you have a nice parallel gpu or npu but is dog slow for cpus
Because diffusion models have a substantially different refining process, most current software isn't built to support it. So I've also been struggling to find a way to play with these models on my machine. I might see if I can cook something up myself before someone else does...
A lot of this post-training recipe feels reminiscent of DINO training (teacher/student, use of stop gradients). I wonder if the more recent leJEPA SigREG regularization research might be relevant here for simpler post-training.
I'd love to know what's going on with the Gemini Diffusion model - they had a preview last May and it was crazy fast but I've not heard anything since then.
I do wonder why diffusion models aren't used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.
Diffusion models need to infer the causality of language from within a symmetric architecture (information can flow forward or backward). AR forces information to flow in a single direction and is substantially easier to control as a result. The 2nd sentence in a paragraph of English text often cannot come before the first or the statement wouldn't make sense. Sometimes this is not an issue (and I think these are cases where parallel generation makes sense), but the edge cases are where all the money lives.
Scaling laws mean that there's not much need to actually scale things to the skies. Instead, you can run a bunch of experiments at small scale, fit the scaling law parameters, then extrapolate. If the predicted outcome is disappointing (e.g. it's unlikely to beat the previous scaled-to-the-sky model), you can save the really expensive experiment for a more promising approach.
It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.
Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.
But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.
Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?
If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.
Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.
The 2.5kW figure is for a server running 10 HC1 chips:
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…
One appeal of it is for RL. If it ends up being a lot faster for generation, you'll be able to do a lot more RL.
If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.
If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.
Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.
Not really. Unlike with physical goods like batteries, the hardware for training a diffusion vs an autoregressive language model is more or less exactly the same.
Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.
At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)
This doesn't mention the drawback of diffusion language models, the main reason why nobody is using them: they have significantly lower performance on benchmarks than autoregressive models at similar size.
Can't wait for the day I can actually try a diffusion model on my own machine (128GB M4 Max) rather than as a hosted service. So far I haven't seen a single piece of software that supports it.
I wish there would be more of this research to speed things up rather than building ever larger models
Notice how all the major AI companies (at least the ones that don't do open releases) stopped telling us how many parameters their models have. Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.
And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.
Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements
I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old
You're hitting on something really important that barely gets discussed. For instance, notice how opus 4.5's speed essentially doubled, bringing it right in line with the speed of sonnet 4.5? (sonnet 4.6 got a speed bump too, though closer to 25%).
It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.
I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers. By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.
These AI companies are all in the same boat. At current operating costs and profit margins they can't hope to pay back the investment, so they have to pull tricks like rebranding models and downgrading offerings silently. There's no oversight of this industry. The consumer protection dept in the US was literally shut down by the administration, and even if they had not been, this technology is too opaque for anyone to really be able to tell if today they're giving you a lower model than what you paid for yesterday.
I'm convinced they're all doing everything they can in the background to cut costs and increase profits.
I can't prove that Gemini 3 is dumber than when it came out because of the non deterministic nature of this technology, but it sure feels like it.
opus 4.6 was going to be sonet 5 up until week of release. The price bump is even bigger than you realize because they don't let you run opus 4.6 at full speed unless you pay them an extra 10x for the new "fast mode"
It kind of makes sense, at least a year or so ago, I know $20.00 unlimited plans were costing these companies ~$250.00 averaged out, they're still lighting money on fire with $200.00 but probably not nearly as bad, however, I'm not sure if costs have gone up with changes in models, seems like the agentic tooling is more expensive for them (hence why they're pushing anyone they can to pay per token).
Similar trend in open text-to-image models: Flux.1 was 12B but now we have 6B models with much better quality. Qwen Image goes from 20B to 7B while merging the edit line and improving quality. Now that the cost of spot H200s at 140GB came down to A100 levels, you can finally try larger scale finetuning/distillation/rl with these models. Very promising direction for open tools and models if the trend continues.
> I doubt frontier models have actually substantially grown in size in the last 1.5 years
... and you'd be most likely very correct with your doubt, given the evidence we have.
What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.
I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.
Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.
They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.
[1] for lack of a better term that I am not aware of.
Obviously, there’s a limit to how much you can squeeze into a single parameter. I guess the low-hanging fruit will be picked up soon, and scaling will continue with algorithmic improvements in training, like [1], to keep the training compute feasible.
I take "you can't have human-level intelligence without roughly the same number of parameters (hundreds of trillions)" as a null hypothesis: true until proven otherwise.
[1] https://arxiv.org/html/2602.15322v1
Why don't we need them? If I need to run a hundred small models to get a given level of quality, what's the difference to me between that and running one large model?
You can run smaller models on smaller compute hardware and split the compute. For large models you need to be able to fit the whole model in memory to get any decent throughput.
Ah interesting, I didn't realize MoE doesn't need to all run in the same place.
>If I need to run a hundred
It's unfair to take some high number that reflects either disagreement, or assumes that size-equality has a meaning.
> level of quality
What is quality, though? What is high quality, though? Do MY FELLOW HUMANS really know what "quality" is comprised of? Do I hear someone yell "QUALITY IS SUBJECTIVE" from the cheap seats?
I'll explain.
You might care about accuracy (repetition of learned/given text) more than about actual cognitive abilities (clothesline/12 shirts/how long to dry).
From my perspective, the ability to repeat given/learned text has nothing to do with "high quality". Any idiot can do that.
Here's a simple example:
Stupid doctors exist. Plentifully so, even. Every doctor can pattern-match symptoms to medication or further tests, but not every doctor is capable of recognizing when two seemingly different symptoms are actually connected. (simple example: a stiff neck caused by sinus issues)
There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.
There are plenty of benefits for running multiple models in parallel. A big one is specialization and caching. Another is context expansion. Context expansion is what "reasoning" models can be observed doing, when they support themselves with their very own feedback loop.
One does not need "hundred" small models to achieve whatever you might consider worthy of being called "quality". All these models can not only reason independently of each other, but also interact contextually, expanding each other's contexts around what actually matters.
They also don't need to learn all the information about "everything", like big models do. It's simply not necessary anymore. We have very capable systems for retrieving information and feeding them to model with gigantic context windows, if needed. We can create purpose-built models. Density/parameter is always increasing.
Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts, can disseminate multiple perspectives on a matter in parallel, boosting context expansion massively.
A single model cannot refactor its own path of thoughts during an inference run, thus massively increasing inefficiency. A single model can only provide itself with feedback one after another, while multiple models can do it all in parallel.
See ... there's two things which cover the above fundamentally:
1. No matter how you put it, we've learned that models are "smarter" when there is at least one feedback-loop involved.
2. No matter how you put it, you can always have yet another model process the output of a previously run model.
These two things, in combination, strongly indicate that multiple small, high-efficiency models running in parallel, providing themselves with the independent feedback they require to actually expand contexts in depth, is the way to go.
Or, in other words:
Big models scale Parameters, many small models scale Insight.
Doesn’t the widely accepted Bitter Lesson say the exact opposite about specialized models vs generalized?
The corollary to the bitter lesson is that in any market meaningful time scale a human crafted solution will outperform one which relies on compute and data. It's only on time scales over 5 years that your bespoke solution will be over taken. By which point you can hand craft a new system which uses the brute force model as part of it.
Repeat ad-nauseam.
I wish the people who quote the blog post actually read it.
Bitter Lesson is about exploration and learning from experience. So RL (Sutton's own field) and meta learning. Specialized models are fine from Bitter Lesson standpoint if the specialization mixture is meta learned / searched / dynamically learned&routed.
I'd suggest that a measure like 'density[1]/parameter' as you put it will asymptotically rise to a hard theoretical limit (that probably isn't much higher than what we have already). So quite unlike Moore's Law.
It's the same thing. Quantize your parameters? "Bigger" model runs faster. MOE base model distillation? "Bigger" model runs as smaller model.
There is no gain for anyone anywhere by reducing parameter count overall if that's what you mean. That sounds more like you don't like transformer models than a real performance desire
Why not both?
Scaling laws are real! But they don't preclude faster processing.
Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There's loads of more "traditional" .gguf options (well, quants) that are practical even on shockingly weak hardware, and I've been seeing things that give me hope that diffusion is the next step forward, but so far it's all been early research prototypes.
I worked on it for a more specialized task (query rewriting). It’s blazing fast.
A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.
Did you publish anything you could link wrt. query rewriting?
How was the quality?
Based on my experience running diffusion image models I really hope this isn't going to take over anytime soon. Parallel decoding may be great if you have a nice parallel gpu or npu but is dog slow for cpus
Because diffusion models have a substantially different refining process, most current software isn't built to support it. So I've also been struggling to find a way to play with these models on my machine. I might see if I can cook something up myself before someone else does...
A lot of this post-training recipe feels reminiscent of DINO training (teacher/student, use of stop gradients). I wonder if the more recent leJEPA SigREG regularization research might be relevant here for simpler post-training.
I'd love to know what's going on with the Gemini Diffusion model - they had a preview last May and it was crazy fast but I've not heard anything since then.
I do wonder why diffusion models aren't used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.
Diffusion models need to infer the causality of language from within a symmetric architecture (information can flow forward or backward). AR forces information to flow in a single direction and is substantially easier to control as a result. The 2nd sentence in a paragraph of English text often cannot come before the first or the statement wouldn't make sense. Sometimes this is not an issue (and I think these are cases where parallel generation makes sense), but the edge cases are where all the money lives.
Google is working on a similar line of research. Wonder why they haven't rolled out a GPT40 scaled version of this yet
Probably because it's expensive.
But I wish there were more "let's scale this thing to the skies" experiments from those who actually can afford to scale things to the skies.
Scaling laws mean that there's not much need to actually scale things to the skies. Instead, you can run a bunch of experiments at small scale, fit the scaling law parameters, then extrapolate. If the predicted outcome is disappointing (e.g. it's unlikely to beat the previous scaled-to-the-sky model), you can save the really expensive experiment for a more promising approach.
It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.
I think diffusion makes much more sense than auto-regressive (AR) specifically in code generation comparing to chatbot.
Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!
I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.
https://taalas.com/products/
Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.
But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.
Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?
If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.
Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.
The 2.5kW figure is for a server running 10 HC1 chips:
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
I’m confused then. They need 10 of these to run an 8B model?
Just tried this. Holy fuck.
I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.
This is a whole new paradigm of AI.
A billion stupid LLMs don't make a smart one, they just make one stupid LLM that's really fast at stupidity.
When that genrates 10k of output slop in less latency than my web server doing some crud shit....amazing!
This is exceptionally fast (almost instant) whats the catch? Answer was there before I lifted return key!
This is crazy!
Is this available as open source anywhere to try?
If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…
Diffusion language models seem poised to smash purely autoregressive models. I'm giving it 1-2 years.
One appeal of it is for RL. If it ends up being a lot faster for generation, you'll be able to do a lot more RL.
If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.
If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.
Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.
Not really. Unlike with physical goods like batteries, the hardware for training a diffusion vs an autoregressive language model is more or less exactly the same.
Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.
At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)
Same with digital vs analog
This doesn't mention the drawback of diffusion language models, the main reason why nobody is using them: they have significantly lower performance on benchmarks than autoregressive models at similar size.
Can't wait for the day I can actually try a diffusion model on my own machine (128GB M4 Max) rather than as a hosted service. So far I haven't seen a single piece of software that supports it.
You can try it today. You can get them from huggingface. Here is an example:
https://huggingface.co/tencent/WeDLM-8B-Instruct
Diffusion isn’t natively supported in the transformers library yet so you have to use their custom inference code.
I don't subscribe to the Python craze, but this could be interesting. Thanks!