LLM-Attention kernel
Program's name: LLM-Attention kernel
Available version(s):
Programming language(s):
C ·
Programming model(s):
CUDA ·
Used in following discipline(s):
Artificial Intelligence / Machine Learning ·
This kernel characterizes one of the main phases of a Large Language Model
(LLM) application: the attention mechanism.
The LLM architecture is defined by three main stages:
- Tokenization, turns text into coordinates in a nD-space.
- Transformer layers, building the meaning and context.
- Unembedding, turning the “meaning” back into words.
At the same time, the transformation layer contain 2 sub-phases:
- The Multi-head attention, performing a single-head attention mechanism in
parallel multiple times
- The Feed-forward layer, a multi-laer perceptron that each token vector
undergoes.
This kernel focuses on the multi-head attention. It executes in a sequence of
consecutive steps:
- Compute the
Query matrix, the word we are currently processing.
- Compute the
Keymatrix, all the other words in the context.
- Compute the
Value matrix, the measure of how much the keys relate to the
query.
- Transpose the
Key matrix, multiply with Query, matrix addition with the
mask matrix, and compute the softmax.
- Multiply resulting matrix with the
Value matrix.