How is LLaMa.cpp possible?
Recently, a project rewrote the LLaMa inference code in raw C++. With some optimizations and quantizing the weights, this allows running a LLM locally on a wild variety of hardware:
Hasnain says:
“Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve— like quantization! This is yet another reason why distillation, or just training smaller models for longer, is really important”
Posted on 2023-08-16T03:56:31+0000