Hasnain Reads

Hasnain says:

Great read that goes into a lot of systems and optimization knowledge. And a worthwhile reminder of just how fast hardware is these days.

“For simple operators, it's feasible to reason about your memory bandwidth directly. For example, an A100 has 1.5 terabytes/second of global memory bandwidth, and can perform 19.5 teraflops/second of compute. So, if you're using 32 bit floats (i.e. 4 bytes), you can load in 400 billion numbers in the same time that the GPU can perform 20 trillion operations. Moreover, to perform a simple unary operator (like multiplying a tensor by 2), we actually need to write the tensor back to global memory.”

Posted on 2022-03-17T07:23:29+0000