Other essential components of a Transformer block.
While the self-attention mechanism is the most novel part of a Transformer layer, two other components are essential for its proper functioning: the Position-wise Feed-Forward Network and Layer Normalization. After the self-attention sub-layer, the output for each token's representation is passed through an identical but separate Position-wise Feed-Forward Network (FFN). This FFN is a simple two-layer fully connected neural network (typically with a ReLU or GELU activation function in between). It is applied independently to each position. Its role is to process the attention-infused representation of each token, adding further non-linear transformations and allowing the model to learn more complex relationships. You can think of the self-attention layer as gathering and mixing information from across the sequence, and the FFN as processing that mixed information for each token individually. Layer Normalization is a technique used to stabilize the training of deep neural networks. In a Transformer, it is applied before each of the two main sub-layers (self-attention and FFN). Layer normalization works by calculating the mean and variance of the inputs to a layer across the feature dimension (i.e., for each individual token's vector representation). It then uses these statistics to normalize the inputs to have a mean of zero and a standard deviation of one. This ensures that the inputs to each sub-layer are consistently scaled, which helps to prevent the gradients from becoming too large or too small during backpropagation, leading to faster and more stable training. Finally, each of these two sub-layers (attention and FFN) in a Transformer block has a residual connection around it. This means the input to the sub-layer is added to its output, which helps prevent the vanishing gradient problem in very deep networks.