I-BERT: Integer-only BERT Quantization
Kim et al, 2021
Read by me: June 5th, 2023
Here is the original paper.
Summary
Most “integer only” quantization of NN typically still do non-linear operations in FP
By doing clever polynomial approximations of GELU, Softmax, and squart root, they were able to implement BERT
Due to the simplicity of integer only operations, I-BERT preformed better than BERT
Not only did they see a reduction in latency, they also saw an increase in accuracy
Thoughts
The approximations for non-linear functions were worse than the floating part counterpart, however ablation studies showed that the model actually preformed better with the integer only version. The paper does not explore this.
The fact that integer approximations did not significantly reduce preformance implies that the percision of float point arithmetic is not strictly necessary in a transformer model
This paper suggests that any hardware-conscience implementation of BERT (or an LLM in general) should not attempt to do a full FP implementation