Machine Learning Overview

The second half of the MLDS lab is machine learning research. Machine learning is a much bigger area of expertise, and there are many (arguably too many) good resources to help build expertise. In this part of the handbook, we'll discuss the big picture, some of the most important underlying math, and connect you with the right resources to learn more.

Machine Learning / Deep Learning / Artificial Intelligence

The relationship between machine learning, deep learning, and artificial intelligence is a lot like the relationship between squares and rectangles. Generally, artifical intelligence is the broadest umbrella term, and it is any technology that can accomplish a complex task without human intervention. A particularly popular way of accomplishing these tasks is to use Machine Learning which is a process through which computers iteratively improve their performance on some specific task. Deep Learning is a subset of machine learning that specifically relies on using deep neural networks to solve complex complex problems, as opposed to other numerical models. Deep learning is broadly considered the most powerful --- though occasionally mysterious --- form of machine learning.

Neural Networks

Neural Networks are just functions. Think about a neural neural just as you would think about a line $y=mx +b$. In some situations you may know the value of $m$ and $b$ a priori, but other times you don't. Machine learning is the process by which you determine the best values for $m$ and $b$ given some data. You'll start by randomly guessing values for these parameters (weights and biases), then you'll evaluate how far off your model is from the data (the cost or loss), and then you'll identify the direction that you can shift your parameters to make that cost lower (gradient descent). Then you'll take a step in that direction, and start over.

This process is identical for fitting neural networks. The only difference is that you have a non-linear function $y = f_{\Theta}(x) = \sigma(\theta_1 x + \theta_2)$, and often that function contains other non-linear functions, which themselves contain other non-linear functions $y=f(g(h(x)))$. These models are really flexible, in fact, they are known as universal function approximators, which suggests that they can model any function regardless of how complex.

A Deeper Dive on Training Neural Networks

What are these neural networks?

Neural networks are far simpler than most people think, and it really is just line fitting.

Recap of Line Fitting

Take a linear model: $$ \hat{y} = f_{\Theta}(x) = \theta_1 x + \theta_2 $$ where $\theta_1$ (m) and $\theta_2$ (b) are elements in the parameter vector $\mathbf{\theta} = [\theta_1, \theta_2]$. Assume you have some data $(x_i, y_i)$ where $i \in [0, N]$ and you want to select a good value for $\theta_1$ and $\theta_2$.

To do this, we need to define a measure of what is "good". Often we do this using mean squared error (MSE) function, albeit it could be whatever we want (e.g. percent error, absolute error, KL divergence, etc.). Let's proceed with a MSE function, and we'll call it our loss: $$ \mathcal{L}(\Theta) = \frac{1}{N} \sum_{i=0}^N \left|f_{\Theta}(x_i) - y_i\right|^2 $$ This loss function captures how bad our current model is, given its current parameters $\Theta$. We want to update our parameters to make our loss as small as possible. One way to accomplish this is to compute the gradient of our loss with respect to our parameters: $$ \nabla_{\Theta}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \Theta} = \frac{1}{N} \sum_{i=0}^N 2 \left|\frac{\partial f_{\Theta}(x_i)}{\partial \Theta} - y_i\right| \left|f_{\Theta}(x_i) - y_i\right| $$ This expression looks more complicated than it is. We know how to take a gradient of $f_{\Theta}(x_i)$ w.r.t. $\Theta$. That just Calculus I.

Let's now imagine that we've computed this gradient, and we now know the slope of our loss function. We know if we shift our parameters along the positive direction of the gradient, the loss will get bigger, and if we shift along the negative direction, our loss goes down. Therefore if our goal is to decrease our loss, we should modify our parameters as follows: $$ \Theta' \leftarrow \Theta - \eta\nabla_{\Theta}\mathcal{L} $$ where $\eta$ is a learning rate, which can increase or decrease the size of the step.

Once the parameters are updated, this process repeats until the loss stabilizes and further iterations don't provide significant return.

Fitting Neural Networks

A neural network is trained in the exact same fashion as the linear model. There are only two small differences. First is the model itself. Rather than using a linear function, neural networks use non-linear functions called hidden layers $$ h_k = \sigma(\theta_{k,1} x + \theta_{k,2}) $$ where $h_k$ is the $k$th hidden layer, and $\sigma$ is some non-linear activation function like $\tanh$ or ReLU. These hidden layers are typically chained together to form the official neural network $$ \hat{y} = f_{\text{NN},\Theta}(x) = h_N(... h_2(h_1(x))) $$

Just as before, all we need to do to train the neural network is define a loss function, and compute the gradient of that loss function w.r.t. the parameters of the model via $$ \nabla_{\Theta}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \Theta} = \frac{1}{N} \sum_{i=0}^N 2 \left|\frac{f_{\text{NN},\Theta}(x_i)}{\partial \Theta} - y_i\right| \left|f_{\Theta}(x_i) - y_i\right| $$ The only difference now is that we need to compute the analytic gradient of the neural network $f_{\text{NN},\Theta}(x)$ with respect to its weights. When you have all of these cascading hidden layers, that seems like it would be quite hard or that you'd have to do some numerical approximation of the gradient. Fortunately, there is an efficient way to compute this gradient exactly: automatic differentiation.

Automatic differentiation, or autodiff, is an technique that decomposes any algorithm in to a cascading graph of small elementary operations and uses chain rule to compute the exact derivative with respect to any intermediate variable. For example, let's compute the gradient of a hidden layer with respect to the parameter $m$. Assume our hidden layer is defined as: $$ h(x) = \sin(mx + b) \rightarrow \sin(\theta_1 x + \theta_2) $$ If we want to compute $\frac{\partial h}{\partial \theta_1}$ using autodiff, then we should first decompose the algorithm into the most elementary operations $(+, - \div, \times, \sin, \cos, \exp)$:

\[\begin{align} y_1 &= \theta_1 x \\ y_2 &= y_1 + \theta_2 \\ y_3 &= \sin(y_2) \end{align}\]

All of these expressions have extremely simple gradients. Humans and computers alike can apply simple rules from calculus to track the derivatives for each of these simple elementary operations:

\[\begin{align} \frac{\partial y_1}{\partial \theta_1} &= x; \quad \frac{\partial y_1}{\partial x} = \theta_1 \\ \frac{\partial y_2}{\partial \theta_2} &= 1; \quad \frac{\partial y_2}{\partial y_1} = 1 \\ \frac{\partial y_3}{\partial y_2} &= \cos{y_2} \end{align}\]

Autodiff does this. It computes these simple expressions, their gradients, and their relationships --- saving them into something called a computational graph. These graphs of simple expressions become enormously useful. When paired with chain rule, we can compute any complex derivative within the algorithm including: $$ \frac{\partial h}{\partial \theta_1} = \frac{\partial y_3}{\partial \theta_1} = \frac{\partial y_3}{\partial y_2} \frac{\partial y_2}{\partial y_1}\frac{\partial y_1}{\partial \theta_1} $$

With these graphs, all we need to do is run two passes through the neural network. The first pass, pushes the input data through the network to compute $y_{1},y_{2}$ and $y_{3}$, and then run a second pass to compute the intermediate gradients / final derivative. This second pass, can be done very efficiently, when run from the back of the network to the front (final layer to input layer), and this process is referred to as backpropagation.

If we perform backpropagation, we can compute gradient of the loss function with respect to every intermediate parameter within the neural network, and just like before, we can move those parameters in the direction that will minimize the loss through: $$ \Theta' \leftarrow \Theta - \eta\nabla_{\Theta}\mathcal{L} $$ Repeat this process and voila! You'll have trained a neural network.

Core Components of a Neural Network

There is a fair amount of jargon surrounding neural networks. In general, there are architectures, loss functions, and hyperparameters. Architectures refer to the specific form of the neural network. The most common architecture is a multi-layer perceptron, or MLP. MLPs are considered the vanilla neural network. They are comprised of an input layer followed by a sequence of hidden layers. These hidden layers weight the outputs from the prior layer and apply some non-linear transformation (activation function) to learn some intermediate feature or variable. The hidden layers can have many different features, often referred to as nodes, and the total number of nodes per hidden layer is typically referred to as it's width. The final hidden layer applies one final linear transformation to arrive at the network output. The outputs predicted by the neural network are then compared with the training data outputs and a Loss is computed. Loss is just a measure of how bad the model did. A common choice is mean squared error (MSE)** $$ \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=0}^N (\hat{y}(x_i|\theta) - y_i)^2 $$ but there are many, many others.

The process of updating the parameters in the network is iterative. In some cases, all data can be pumped through the network in parallel, and the loss can computed over all of the data points. In other cases, this would be too much data and you'd run out of RAM on your computer, so the data often needs to be broken into batches, or mini-batches. For example, if your training data has 100 very large images, but your computer can only hold 10 in memory at a time, then you'd run a mini-batch of 10 through the network, compute the loss for that batch, and update the weights accordingly. This would be repeated 10 times to get through all of the data, which would be considered one training epoch. Often times neural networks will be trained for hundreds or thousands of epochs before they converge on their final solution.

The choice of total epochs or batch size is often determined empirically, and is referred to as hyperparameter optimization. These values are called hyperparameters simply to distinguish them from the parameters within the network that get updated automatically. Hyperparameters, instead, need to be selected by the human designing the system. There are many different hyperparameters beyond epochs and batch size, including things like the width and depth of the network, the type of activation function to use, the learning rate, etc.

The Mystique of Hyperparameter Optimization

When junior engineers first encounter a neural network that isn't working, they'll often try "fixing it" by tweaking the hyperparameters. This is not a good idea. 99% of the time, the problem doesn't have anything to do with the mechanics of the network itself, but rather how the data is presented to the network. Hyperparameter optimization should be the absolute last step.

Aside on Loss Functions

As mentioned, there are tons of loss functions and your choice of loss function can have an enormous impact on model performance. The choice of loss can also depend on the type of problem you're trying to solve. Our group is most often interested in regression problems, where the learned function has continuous outputs. The alternative are classification problems, where there may be a discrete set of possible outputs --- think labeling an image as either cat or dog. For regression problems MSE or percent error are often good starting points. For classification, cross-entropy loss is commonly used.

Common Neural Network Architectures

While MLPs are the most common neural networks, ML researchers have proposed alternative architectures that are better suited for certain types of tasks. For example, convolutional neural networks, or CNNs are designed to more efficiently handle gridded data like images. Consider how a 1024x1024 pixel image has over 1,000,000 pixels with 3 channels. If each of those pixels were connected with a node in a hidden layer, network sizes would grow enormous. CNNs bypass this problem, by taking the 2D images as an inputs, but applying learned filters that downscale and transform the raw image into much smaller latent features. Once enough of these filters are applied, CNNs will have some small latent feature set that can then be passed through a reasonably sized MLP to produce the desired output.

Another popular architecture is recurrent neural networks. RNNs are historically used to handle temporal data. These networks append the network output to the network input, such that the network has some memory of its past prediction. RNNs are very slow, as they must be trained one data point at a time. They also struggle from the vanishing gradient problem where the learned effects of the early data get lost over time. More advanced RNN architectures include things like the Long-Short Term Memory (LSTM) architecture, which solves the vanishing gradient problem by purposefully allowing the least relevant information to be forgotten.

**Transformers are also a popular type of architecture. These models also handle information that has temporal structure, but they accept all of that information at once, and identify the most important relationship between different parts of provided data. This process is referred to as the self-attention mechanism which identifies which part of the data is most important to yield the desired output. Tranformers are the current architectures that are leading the Large Language Model (LLM) boom.