LLM-Attention kernel

Program's name: LLM-Attention kernel
Available version(s): Programming language(s): C ·
Programming model(s): CUDA ·

This kernel characterizes one of the main phases of a Large Language Model (LLM) application: the attention mechanism.

The LLM architecture is defined by three main stages:

  • Tokenization, turns text into coordinates in a nD-space.
  • Transformer layers, building the meaning and context.
  • Unembedding, turning the “meaning” back into words.

At the same time, the transformation layer contain 2 sub-phases:

  • The Multi-head attention, performing a single-head attention mechanism in parallel multiple times
  • The Feed-forward layer, a multi-laer perceptron that each token vector undergoes.

This kernel focuses on the multi-head attention. It executes in a sequence of consecutive steps:

  1. Compute the Query matrix, the word we are currently processing.
  2. Compute the Keymatrix, all the other words in the context.
  3. Compute the Value matrix, the measure of how much the keys relate to the query.
  4. Transpose the Key matrix, multiply with Query, matrix addition with the mask matrix, and compute the softmax.
  5. Multiply resulting matrix with the Value matrix.