Linformer: Self Attention with Linear Complexity
Paper available on https://arxiv.org/abs/2006.04768
Recap:
- Feed-forward networks route all nodes between layers
- Convolutional networks usually route nodes from neighbours (smaller set)
- Transformers route information according to the information itself using queries (q) and keys (k)
Query is calling what nodes the information is coming from while Key provides information on what kind of information is contained. The network is fully connected (like a feed-forward), but how the information is routed is based on inner products between Q and K. Multihead attention mechanism split and route different kinds of information for each self-attention mechanism unit.
Case
- Paper calculates eigenvalues of RoBERTa to arrive at a broad estimate of the rank of the softmax matrix.
- Paper compares normalized cumulative eigenvalues of a softmax matrix that's calculated in the attention model and shows that the matrix is inefficient and is not low-rank.
- Splitting the multihead attention will push more information into low-ranking matrices.
Theorem: Self-attention is low-rank.
P tilde is low-rank of the W matrix (routing matrix). Routing with Ptilde will effectively route in the same way as routing with W matrix (plus epsilon), but arriving at P is less computationally expensive. This is done using Johnson-Lindstrauss lemma (1948) of projecting high-dimensional matrices.
Model
- Project matrix K. Multiply KQ, softmax, use it to route V.
- Idea is to go from O(n2) to O(nk).
- Notice Linformer's inference time compared to Transformer's below:
Key takeaways
- Linformer is not a better model, but a faster model in certain cases.
- Linearization of computing time is based on JL lemma.
- Low-rank matrices are more efficient.