Hasnain Reads

Hasnain says:

“I’ll confess I struggle to see why this shouldn’t be possible in theory. As far as I can tell the practical barrier is how the attention step is batched: if you want to batch up attention GEMMs, they need to all be the same shape (i.e. the same number of prior tokens in the sequence). So you have to run groups of the same shape at the same time, instead of being able to just maintain a single queue. There’s at least some public research on this front, but I wouldn’t be surprised if there were more clever tricks for doing this that I haven’t seen.”

Posted on 2025-06-01T20:05:08+0000