Why DeepSeek is cheap at scale but expensive to run locally
Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once…
Hasnain says:
“I’ll confess I struggle to see why this shouldn’t be possible in theory. As far as I can tell the practical barrier is how the attention step is batched: if you want to batch up attention GEMMs, they need to all be the same shape (i.e. the same number of prior tokens in the sequence). So you have to run groups of the same shape at the same time, instead of being able to just maintain a single queue. There’s at least some public research on this front, but I wouldn’t be surprised if there were more clever tricks for doing this that I haven’t seen.”
Posted on 2025-06-01T20:05:08+0000