I'm not sure I agree. LLM's have the feel of an alien new technology, and especially did back then. In retrospect, it feels very obvious that small models don't pose much of a threat, but that's only in retrospect.
As long as your workload can handle resuming again and your instances aren't heavily in-demand (looking at the eviction rates), the cost saving for us is substantial enough to take the occasional interruption.
I do wish Azure gave more than the 30 second eviction warning (like AWS) but still useable.
it depends, our workloads can finish up in under two minutes and shut down without much effort, so we haven’t really noticed it outside of one time when we had no spot capacity.
I guess if checkpointing is set up correctly and your runtime is saved to a docker image it’s feasible. Probably not going to get a 3 hour continuous chunk of time I would assume.
https://nitter.net/karpathy/status/2018804068874064198
> GPT-2 (7 years ago): too dangerous to release.
With the benefit of hindsight you can see the the charitable foundation to benefit mankind was a grift all along.
I'm not sure I agree. LLM's have the feel of an alien new technology, and especially did back then. In retrospect, it feels very obvious that small models don't pose much of a threat, but that's only in retrospect.
Nobody else should be allowed to build these ... while we build another model that is 10x more capable as the one that was a threat to humanity.
Sign up today to use it, just $10 a month.
This is a fun, new speedrunning genre.
Never used gpu spot instances before but I would have to imagine getting interrupted is pretty annoying.
As long as your workload can handle resuming again and your instances aren't heavily in-demand (looking at the eviction rates), the cost saving for us is substantial enough to take the occasional interruption.
I do wish Azure gave more than the 30 second eviction warning (like AWS) but still useable.
it depends, our workloads can finish up in under two minutes and shut down without much effort, so we haven’t really noticed it outside of one time when we had no spot capacity.
I guess if checkpointing is set up correctly and your runtime is saved to a docker image it’s feasible. Probably not going to get a 3 hour continuous chunk of time I would assume.
When I once used Spot it wasn't that bad. You're likely to have an instance for 3 hours.
In addition to the two-minute interruption notice, rebalance recommendations[0] allow you to handle interruptions even more gracefully.
[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalanc...