How models measure error and learn by minimizing it.
At the core of machine learning is the process of 'learning' from data, which is formalized through the concepts of loss functions and optimization. A loss function (or cost function) is a mathematical function that quantifies the error or 'loss' of a model for a given set of input data. It measures the difference between the model's predicted output and the actual, true output. The choice of loss function depends on the task. For regression problems, a common choice is the Mean Squared Error (MSE), which calculates the average of the squared differences between predicted and actual values. Squaring the error penalizes larger mistakes more heavily. For classification problems, Cross-Entropy Loss is frequently used, which measures the performance of a classification model whose output is a probability value between 0 and 1. The goal of the training process is to find the set of model parameters (weights and biases) that minimizes this loss function. This is where optimization algorithms come in. The most fundamental optimization algorithm is Gradient Descent. The 'gradient' is a vector that points in the direction of the steepest ascent of the loss function. Therefore, to minimize the loss, we need to move in the opposite direction of the gradient. Gradient Descent works by calculating the gradient of the loss function with respect to the model's parameters and then taking a small step in the negative direction of the gradient. This process is repeated iteratively, with the model's parameters being updated at each step, gradually 'descending' the loss landscape until a minimum is reached. A 'learning rate' hyperparameter controls the size of each step, and tuning it is crucial for effective training.