This is the second post in our computer vision blog series. In the first entry we focused on what computer vision is and how the computer sees an image. Now it’s time to look at convolutional neural networks, which are at the core of modern computer vision applications. At first, let’s have a look what Neural Networks actually are.
What’s a neural network?
Neural networks (NNs), also known as artificial neural networks (ANNs), belong to a family of algorithms referred to as deep learning, which is a branch of machine learning. The name and structure of neural networks are loosely inspired by neural circuits in the human brain, attempting to mimic the way biological neurons signal to one another.
Neural networks are composed of a collection of layers, containing an input layer, one or more hidden layers, and an output layer. Each layer consists of nodes, also known as artificial neurons or hidden units, which connect to the succeeding layer’s nodes. The connections have a weight each, which basically defines how much that connection matters for a given input. The output range of a node is defined by something called an activation function, and each node corresponds to a particular set of weights. An activation function decides how active the node is, meaning how big of a role that node plays given a specific input. In simple terms, if a node is activated, it sends data to the next layer of the network, and if the node is not activated, no data is passed forward. The depth of a neural network is determined by the number of hidden layers, and the width of a layer is determined by the number of nodes.
Figure 1: An illustration of a deep neural network, the circles are nodes, the lines between them are the connections ()
Mathematically speaking, neural networks are nothing more than a bunch of functions that take some numeric input, which is just a sequence of numbers, and produce an output, which is a single value (…well, not always). At each node, a linear combination (weighted sum) is applied to the input, thus transforming the input numbers into a single value, which is then passed through the activation function that produces the output of the node in question, which is also a single value. Each node can be mathematically formulated as a(f(x)) = y, where x is the input, f(.) is a function that performs linear combination and a(.) is the activation function, and y is the output of the node that is defined by the activation. Each node is associated with a particular set of weights and a bias (a single value) that is used to compute the weighted sum.
Figure 2: Above: A linear regression equation, which is the weighted sum of input values (x), weights (w) and the bias. Below: the artificial neuron with a binary step function as the activation function (, figure below is modified).
The choice of activation function is the most important part of a node as it determines the output of the node, whereas the linear combination applied to the input is always the same (Figure 2, above). In essence, the activation function controls how well the neural network learns. The choice of the activation function in the hidden layers is arbitrary in the sense that there is no rule that determines which one to choose really, whereas in the output layer it depends on the loss function, also known as the cost function, of choice, which in turn depends on the problem the neural network is trying to solve.
The main intuition behind activation functions in the hidden layers is that while they are nonlinear (or piecewise linear) functions, they introduce linear separability to the feature space. Linear separability means that it’s possible to draw a straight line that separates, for example, two different sets of points from each other (e.g. in Figure 3, after applying the rectified linear unit, ReLU, activation on the data it is possible to separate the yellow and blue dots with a straight line).
Figure 3: A 2-dimensional example of how the ReLU activation function, which is piecewise linear, transforms input data from a linearly non-separable space to a linearly separable space (modified from , slide 65).
A loss function or cost function (sometimes also called an error function) is a function that maps a set of inputs onto a number that represents some “cost” associated with some event. The objective is to minimise the loss function, meaning that we want the output of the loss function to be as close to zero as possible. The closer we get to zero, the closer the input is to the corresponding event (the better the predictions are). And that is pretty much what neural networks are all about, they learn to minimise some loss that is the difference between the prediction and its respective target, commonly referred to as the groundtruth.
Figure 4: A common regression loss function, RMSE (root mean squared error), used for predicting real-valued quantities (such as the price of a car). Note that loss is defined as an average and therefore a loss function outputs a single value .
Loss functions can be generally divided into three categories; regression loss functions, binary classification loss functions, and multiclass classification loss functions. Regression loss functions are used for problems where the aim is to predict a real-valued quantity, such as the price of a house. Binary classification loss functions are used when the aim is to predict whether something belongs to one of two different categories, for example, if an animal in an image is a dog or a cat. Multiclass classification loss functions are used when the aim is to assign something to one of more than two categories, for example, if an animal in an image is a cat, a dog, or a rabbit.
Figure 5: An example of the contour of some loss function. At the global minimum the loss is the lowest (blue valley) and at global maximum the loss is the highest (red hill) .
How do neural networks actually learn?
Now that we have covered the basic components a neural network is made out of, it’s time to see how a neural network actually learns. The training of a neural network requires an algorithm called backpropagation, and an optimization algorithm, of which there are many to choose from but we’ll focus on an optimization algorithm known as gradient descent (GD). The learning part of neural networks is mathematically quite complex (based on differential multivariable calculus), so instead of looking at some maths stuff, let’s focus on a conceptual understanding.
The backpropagation (generally referred to as backprop) algorithm computes the gradient of the loss function, and is based on the chain rule of calculus. The gradient is the rate of change of one variable with respect to another, and is zero at a minimum, which is what a neural network is trying to achieve. To rephrase, the gradient measures how much the output of a function changes when the input changes a tiny bit. Backprop computes the gradient of the loss with respect to the weights and biases, which change the input a little bit. The gradients are propagated back (hence the name) from the output (the loss), all the way through to the first layer. The weights and biases are updated along the way based on the gradients so that next time we get better predictions that produce a smaller loss and therefore smaller gradients for the next backpropagation update. We continue to do this until we get sufficiently close to zero, i.e. when our predictions are sufficiently close to the groundtruth values.
Figure 6: Solving an optimization problem in deep learning is about finding a way down a hill into a valley (Figure 5) where the loss is (near) zero. This is basically what gradient descent and other optimization algorithms are trying to achieve (, slide 12).
However, for the backpropagation algorithm to work, we need to use an optimisation algorithm, which tries to find a way down the hill into a valley where the loss is at minimum. So let’s look at gradient descent. The starting point (location on the hill) is defined by the initial weights and biases. After the first forward pass, gradient descent computes the gradient of the loss with respect to the weights and biases, and updates them accordingly. The update, however, is scaled by a single value referred to as the learning rate, or step size (how big of a step we take down the hill). The learning rate is often a very small decimal number in order to avoid a too big of an update that would result in overshooting the valley. This process is repeated until we get sufficiently close to the bottom of the valley. Therefore, gradient descent, and all other optimization algorithms too, is an iterative process.
Convolutional Neural Networks
Now that we have some understanding of what neural networks are and how they work, it’s time to move on to convolutional neural networks (CNNs), or to be more precise, convolutional layers. A convolutional neural network, or a conv net, is different from a regular neural network in three aspects; the input is a matrix of numbers instead of a sequence, the convolutional layer is not made up of nodes but of filters (kernels) that consist of the weights, and the output of each filter is also a matrix instead of a single value.
Figure 7: A 3×3 filter applied on a grayscale input image. Note: the bias is not shown here but it is added to each cell of the output matrix .
When a filter is applied on the input, the resulting output matrix is usually normalised (rescaling of the values in the matrix) and then pushed through an activation function, similar to a regular neural network. The output of the activation function is the matrix itself (referred to as a feature map), with some or all values being affected by the activation function. The values in a filter are the weights that are updated through backpropagation. There is no set rule for the size of a filter, but common sizes are 1×1, 3×3, 5×5, and 7×7. The reason behind these uneven sizes is that, in simple terms, an odd-sized filter encodes the central pixel (i.e. the source pixel) and its surrounding pixels, thereby keeping all relevant information about that particular region of the image.
The filter is convolved across the input (hence the name, convolutional neural network), or in other words, the filter is slid across the input matrix, left-right and top-bottom. The filters learn, so to say, to look for certain features such as edges and colours in the first layers, and in the latter layers larger elements of the objects and in the final layers the objects themselves. This is because filters are applied to not only the input image, but also to the output of each convolutional layer, which allows for a hierarchical composition of the input image. So lower (first) layers focus on low-level features and higher (last) layers on high-level features, and intermediate layers focus on everything in between.
It’s common to include something called a pooling layer after a convolutional layer to perform downsampling. There are basically two types of pooling layers, max pooling and average pooling. The pooling layer is much like the filter but is often of dimensions 2×2. Max pooling maps a 2×2 region, for example, to a 1×1 output (with a stride of 1 which determines by how many steps the pooling operation is moved, so to say) where the value is the maximum value of the source region. Average pooling, like the name suggests, takes the average of the source region. This kind of a pooling operation is referred to as local pooling. A global pooling, on the other hand, simply takes the max or average (depending on the choice of pooling) across all channels. For example, if the input is of dimensions 1x7x7x3 then the output is of dimensions 1×3, whereas with local pooling using size 2×2 the dimensions of the output would be 1x3x3x3. The point of pooling is basically to summarise the most important features present in the input as it reduces the size of the feature map.
Although this blog post was heavy on the maths of it all, and therefore might not be easily accessible, the point that I want to highlight is that neural networks aren’t anything more than maths. So it’s perfectly fine if this post was a little overwhelming because the main takeaway should be that it’s all based on well-established concepts in a few fields of mathematics, namely linear algebra, multivariable calculus, and optimization. We did skip a lot of the nitty gritty details because the focus was on the main aspects. However, the takeaway should be that neural networks aren’t magic or even “a black box” in other aspects than the fact that we haven’t been able to establish a theorem-proof approach to deep learning yet. Currently we are only able to reason based on what we know about the underlying maths and what we can learn from empirical evidence. That being said, explainable AI is an active area of research.