Introducing non-linearity with Sigmoid, Tanh, and ReLU.
Activation functions are a critical component of any neural network. They are mathematical functions applied to the output of a neuron (or a layer of neurons) that determine whether that neuron should be 'activated' or not. Their primary purpose is to introduce non-linearity into the network. Without non-linear activation functions, a deep neural network, no matter how many layers it has, would behave just like a single-layer linear model. It would only be capable of learning linear relationships, severely limiting its power. By introducing non-linearity, activation functions allow the network to learn much more complex and intricate patterns in the data. Several types of activation functions are commonly used. The Sigmoid function was historically popular. It squashes any real-valued input into a range between 0 and 1, which is useful for the output layer of a binary classification problem where the output represents a probability. However, it suffers from the 'vanishing gradient' problem, which can slow down or stall the learning process in deep networks. The Hyperbolic Tangent (Tanh) function is similar to Sigmoid but squashes values to a range between -1 and 1. Its output is zero-centered, which can sometimes help with optimization. It also suffers from the vanishing gradient problem. The most widely used activation function in modern deep learning is the Rectified Linear Unit (ReLU). Its function is very simple: it outputs the input directly if it is positive, and outputs zero otherwise (f(x) = max(0, x)). ReLU is computationally very efficient and helps to mitigate the vanishing gradient problem, allowing for faster and more effective training of deep networks. Variants like Leaky ReLU and ELU have been developed to address some of ReLU's own minor shortcomings.