Hasnain says:

“From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this -- after all this time, it’s still just matrix multiplies, reductions, and reshapes. And we think the hardware wants this, too -- small matrix multiplies are just begging for hardware support beyond just the systolic mma.

In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match.”

Posted on 2024-05-13T00:52:26+0000