This kernel characterizes one of the main phases of a Large Language Model (LLM) application: the attention mechanism.
The LLM architecture is defined by three main stages:
At the same time, the transformation layer contain 2 sub-phases:
This kernel focuses on the multi-head attention. It executes in a sequence of consecutive steps:
Query matrix, the word we are currently processing.Keymatrix, all the other words in the context.Value matrix, the measure of how much the keys relate to the
query.Key matrix, multiply with Query, matrix addition with the
mask matrix, and compute the softmax.Value matrix.