Splitting TPUs into dedicated training vs inference chips feels like an admission that the bottleneck has shifted from FLOPs to memory bandwidth + latency. Are future gains to come more from memory/system design than raw compute scaling? What’s that saying about Scaling laws?
> Splitting TPUs into dedicated training vs inference chips feels like an admission that the bottleneck has shifted from FLOPs to memory bandwidth + latency.
With the expected scale of inference, it makes cost sense to make dedicated hardware for each task if the workloads are even slightly different. Probably similar to the video decoding chips in TVs not being very cheap/efficient compared to chips capable of encoding video.
I think the first two paragraphs of the post are exactly saying that the bottleneck is memory... Long contexts, bigger but less flop-intensive models (moe's).
The funny thing about scaling laws is that as soon as they were known, the whole objective became learning how to break them - bending the curve, at least. They provided an incredibly useful target, but 'law' was a bit too strong a word.
No matter how smart your large language model is, if you can’t find the energy to power it, it won’t run. I could imagine Google winning merely because their chips are more efficient. Of course, the other labs are capable of making chips, but Google has been doing it for years.
Splitting TPUs into dedicated training vs inference chips feels like an admission that the bottleneck has shifted from FLOPs to memory bandwidth + latency. Are future gains to come more from memory/system design than raw compute scaling? What’s that saying about Scaling laws?
> Splitting TPUs into dedicated training vs inference chips feels like an admission that the bottleneck has shifted from FLOPs to memory bandwidth + latency.
With the expected scale of inference, it makes cost sense to make dedicated hardware for each task if the workloads are even slightly different. Probably similar to the video decoding chips in TVs not being very cheap/efficient compared to chips capable of encoding video.
I think the first two paragraphs of the post are exactly saying that the bottleneck is memory... Long contexts, bigger but less flop-intensive models (moe's).
The funny thing about scaling laws is that as soon as they were known, the whole objective became learning how to break them - bending the curve, at least. They provided an incredibly useful target, but 'law' was a bit too strong a word.
> admission that the bottleneck has shifted
There's no admission - this has always been known.
Super interesting but it's so damn hard to find any detail.
I would love to see an instruction set reference for one of these, all you have is hardware architectural diagrams or high level APIs.
No matter how smart your large language model is, if you can’t find the energy to power it, it won’t run. I could imagine Google winning merely because their chips are more efficient. Of course, the other labs are capable of making chips, but Google has been doing it for years.
2.764 petabytes of HBM per 8i? So that's where all the RAM went.
288 TB/pod (1024 chips).
dupe https://news.ycombinator.com/item?id=47862497
They are different blog posts, written by different people at Google
[dead]