Most "sparse training" in PyTorch today isn't actually sparse. A binary mask gets multiplied into a dense weight matrix, which means the zeros still consume memory, still move through the cache, and still get multiplied. That's pruning simulation, not sparse computation. SparseLab does the other thing: real sparse storage (a custom Padded-CSR layout), custom NEON kernels, in an nn.Linear-compatible layer.
The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.
Reproduced numbers (M3 Pro, all in the repo's docs/demos/):
- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.
- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.
- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.
- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.
Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.
API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:
import sparselab
layer = sparselab.SparseLinear(1536, 384, sparsity=0.9)
algo = sparselab.RigL(sparsity=0.9, drop_fraction=0.3, update_freq=100)
layer.apply(algo) # mutates topology during training
Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:
1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.
2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.
3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.
4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.
Most "sparse training" in PyTorch today isn't actually sparse. A binary mask gets multiplied into a dense weight matrix, which means the zeros still consume memory, still move through the cache, and still get multiplied. That's pruning simulation, not sparse computation. SparseLab does the other thing: real sparse storage (a custom Padded-CSR layout), custom NEON kernels, in an nn.Linear-compatible layer.
The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.
Reproduced numbers (M3 Pro, all in the repo's docs/demos/):
- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.
- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.
- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.
- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.
Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.
API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:
SparsityAlgorithm is modeled on Cerebras's SparsityAlgorithm API (https://training-api.cerebras.ai/en/latest/wsc/tutorials/spa...) and credited in the docstrings. v0.1 ships Static, SET, and RigL.Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:
1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.
2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.
3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.
4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.
[dead]