Built for sweeping strategies, sanity-checking cluster budgets, and building intuition for parallelism tradeoffs — not a substitute for profiling production workloads. Calibrated against published runs from Meta, DeepSeek, and NVIDIA within 1-2 percentage points MFU:
- LLaMA 3.1 405B (16K H100): 41.1% sim vs ~40% published
- DeepSeek V3 (2048 H800): 44.7% sim vs 43.7% published
- Nemotron-4 340B (6144 H100): 41.2% sim vs 41-42% published
Important caveat: the model captures physics (compute, memory bandwidth, communication) but not runtime optimisations and fused kernels, so inference is the weaker side.
Built for sweeping strategies, sanity-checking cluster budgets, and building intuition for parallelism tradeoffs — not a substitute for profiling production workloads. Calibrated against published runs from Meta, DeepSeek, and NVIDIA within 1-2 percentage points MFU:
- LLaMA 3.1 405B (16K H100): 41.1% sim vs ~40% published
- DeepSeek V3 (2048 H800): 44.7% sim vs 43.7% published
- Nemotron-4 340B (6144 H100): 41.2% sim vs 41-42% published
Important caveat: the model captures physics (compute, memory bandwidth, communication) but not runtime optimisations and fused kernels, so inference is the weaker side.
Configs to try:
- LLaMA-3.1 (405B) on 16,384x NVIDIA H100 — https://simulator.zhebrak.io/?preset=llama3-405b
- Qwen3 MoE (235B) on 4x NVIDIA H200 SXM — https://simulator.zhebrak.io/?preset=qwen3-235b-inference
GitHub with benchmarks and examples: https://github.com/zhebrak/llm-cluster-simulator
If you have published training runs with MFU or throughput numbers, I'd love to hear from you to expand calibration.