3 points | by freakynit 6 hours ago ago
2 comments
Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:
llama-server \ --model /mnt/ubuntu/models/llama-cpp-qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --ctx-size 150000 \ --n-gpu-layers 99 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --parallel 3 \ --kv-unified \ --ctx-checkpoints 32 \ --checkpoint-every-n-tokens 8192 \ --checkpoint-min-tokens 64 \ --flash-attn on \ --batch-size 4096 \ --ubatch-size 1024 \ --reasoning on \ --temp 0.6 \ --top-p 0.95 \ --top-k 20
checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.
Update: spot terminated
Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:
I was wondering if turboquant is worth the effort right now, but I'm not yet seeing it speed wise.checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.
Update: spot terminated