Learning from Backpropagation
A while back, I decided to write an artificial neural network from scratch, and to keep things relatively simple, I focused on a single neuron that does basic addition. My main goal of the project was not to solve a difficult problem but to understand how neural networks work under the hood. In particular, I was interested in how neural networks actually learn. This blog post details my implementation (available on GitHub) documenting Gradient Descent and Backpropagation, as well as a number of issues that arose from my simplistic example. For those unfamiliar with neural networks or needing a refresher, I’ve included some definitions.
Basic Terminology
Before going any further it is important to go over some of the basic terminology associated with Neural Networks in the field of Artificial Intelligence.
Neuron – A computation node, loosely based on a biological neuron, that reduces multiple inputs down into one output. Externally, Neurons just have an inputs and output, but internally they have weights and activation functions.
Neural Networks (NNs) – A network of neurons generally grouped into a series of layers, with output of one layer becoming the input to next layer; as shown in Figure 1.
Weights – A weight is a number that determines how much influence an input has. These weights are what the network learns.
Weighted inputs – Inputs that have had their respective weights (generally via multiplication) applied.
Aggregation function – A function that aggregates a neuron’s weighted input into a single number. A common aggregation function is a weighted sum, which simply multiplies each input by its respective weights and sums the weighted inputs together.
Activation function – A function that performs an operation on the aggregated weighted inputs, often thresholding values.
Feedforward neural network – A basic form of neural network where input data is fed in one end and forwarded through the network, with no loop, to produce an output. I have used the term feedforward process to refer to functions relating to the forward flow of data through the neurons.
Gradient descent and backpropagation – The learning method this post aims to explain.
My Adding Neural Network
The neural network I ended up creating is a single neuron that takes two inputs. It multiplies the inputs by their respective weights and returns the sum. I originally aspired to implement a network with multiple neurons, however, debugging multiple neurons while trying to hack together a neuron class proved chaotic. Switching back to a single neuron meant that I could focus my attention on the neuron with a single pair of weights. However, the Neuron class remains generic enough that it could be used to look at multi-layered networks at a later stage.
Neuron Class
I’ve implemented a Neuron class to encapsulate the inner workings of a neuron and broken up the explanation into 2 small sections covering the initialisation of the neuron and feedforward methods, as well as a larger section to explain gradient descent and backpropagation.
Structure Initialisation
class Neuron {
constructor(input_size, w_min, w_max) {
this.weights = utils.create_random_array(input_size, w_min, w_max);
this.inputs = numeric.rep(0, input_size);
this.w_grads = numeric.rep(0, input_size);
this.i_grads = numeric.rep(0, input_size);
}
}
The Neuron
class constructor initialises the structure of the neuron, which includes a bunch of arrays that are used (and explained) later. You may notice that the weights
array is assigned random values. This was done to demonstrate that it doesn’t matter what the initial weights are, as the neuron should learn to correct them. There are ways to optimise weight initialisation, but we won’t explore those methods in this post. It is also worth noting that I’m using the numeric JavaScript library, which provides methods for linear algebra operations on arrays. For those familiar with Python, it is similar to NumPy for JavaScript.
Feedforward Methods
class Neuron {
forward(inputs) {
this.inputs = inputs;
let agg_inputs = this.aggregation(inputs);
return this.activation(agg_inputs);
}
aggregation(inputs) {
let weighted_i = numeric.mul(inputs, this.weights);
return numeric.sum(weighted_i);
}
activation(z) {
return 1 * z;
}
}
The feedforward process can be divided into two different methods: aggregation
and activation
. These methods can vary in implementation. In our case, the aggregation function simply multiples the two inputs by their respective weights and sums the result. To keep things simple, the activation function is implemented as an identity function (i.e. f(x)=x
). While this may seem redundant, I left it there to replicate the process that occur in more complex neural networks.
Gradient Descent & Backpropagation
Stepping outside the neuron for a moment, to teach the neuron something, we need a way to evaluate when the predicted output is wrong, and better yet, how wrong the output is. This is accomplished with the Cost function (also referred to as a Loss function), which is a convex function like the Squared Error function shown in Figure 3. For neural networks, this involves learning where the minimum is, and adjusting to move towards it.
To find the minimum, we use the iterative approach known as Gradient Descent. This method examines the gradient of the cost function with respect to the network’s output, in order to find out in what direction the output needs to move. As seen in Figure 4, the minimum can be identified as point where the gradient is equal to 0, so we can make it our goal to get to that point. Additionally, the gradient on either side of the minimum tells us if the output was too low (negative gradient) or too high (positive gradient).
Simply subtracting our gradient from our output steps toward the direction of the minimum. However, there would be a couple of problems with this approach. Firstly, we do not actually have direct control over the final prediction, only over the intermediate weights. Secondly, despite the gradient telling us which direction (positive or negative) to step, the gradient magnitude can be largely disproportionate to the distance.
To address the first problem, we can use backpropagation to propagate this gradient back through the network (or single neuron, in our case) and calculate the proper gradient for each weight. This process makes use of the chain rule, which allows us to break down the process of deriving gradients for the weights. In practice, we calculate our gradient of the cost function with respect to the neuron’s weights by multiplying gradient of the cost function over the network’s output multiplied by the gradient of the output over the weights.
The code below is my implementation of backpropagation, using the chain rule to separate the derivatives of activation and aggregation, and passing the cost back through to calculate the weight gradients.
class Neuron {
backpropagate(grad) {
let pre_act_grad = this.backprop_activation(grad);
let grads = this.backprop_aggregation(pre_act_grad);
this.i_grads = grads[0];
this.w_grads = grads[1];
return this.i_grads;
}
backprop_activation(grad) {
return 1 * grad;
}
backprop_aggregation(grad) {
let weight_grads = numeric.mul(this.inputs, grad);
let input_grads = numeric.mul(this.weights, grad);
return [input_grads, weight_grads];
}
}
As my activation function is f(x)=x
, its derivative of 1 doesn’t really do anything, but like the activation function it remains for demonstrative purposes. For this aggregation derivative, the gradient (which has been backpropagated through the activation derivative) is multiplied by the neuron’s inputs to obtain gradients with respect to each weight. Additionally, I also calculate the gradient with respect to the inputs because, for a multilayer neural network, this gradient would need to be backpropagated to the previous layer.
At this point we have the weight gradients and we are almost ready to subtract them from the weights. However, we still have to address our second problem. If we are not careful with how much we subtract, we may overstep and miss the minimum of our cost function. Furthermore, it is possible to overstep so much that the resulting absolute value of the gradient will grow with each sequential step. In order to protect against this over-stepping problem, we scale the gradient down by a factor smaller than 1, known as learning rate. This learning rate decreases throughout the iterative process of Gradient Descent to avoid wasteful overstepping of the minimum. Note, if the learning rate is too small, your neural network will take forever to make any learning progress. Below is the code for learning from the gradients at a specified learning rate.
class Neuron {
learn_from_grads(learning_rate) {
let regularised_grads = numeric.sub(this.w_grads, this.weights);
let step = numeric.mul(regularised_grads, learning_rate);
this.weights = numeric.sub(this.weights, step);
}
}
You will notice that there is an operation in the code that subtracts a weight’s current value from its respective gradient. This is a basic regularisation method that I borrowed from Karpathy’s Support Vector Machine example [2]. Regularisation is a method that helps avoid the overfitting problem. Overfitting is not really an issue in the context of addition, but adding regularisation makes this demo more interesting.
Training, Testing & Results
Unlike real world applications of neural networks, I am able to generate my own dataset, as I can calculate the ground truth of any pair of inputs (i.e. the sum of two numbers). For this, I wrote a create_dataset
function that generates input/output pairs in the form of a nested array of floats, as shown below:
[
[input1,input2],[output] // one row entry
]
The code below displays the key setup variables and iterative training process used to test my single neuron NN. The full code can be found on my GitHub repository.
// Percentage of error for correctness tolerance
const ERROR_MARGIN = 0.01;
// initialisation of datasets
let training_set = create_dataset(500000);
let test_set = create_dataset(50000);
// making a neuron
let n = new neuron.Neuron(2, 0, 2);
// Define number of epochs
let total_epoch = 10;
// demo random weights
run_demo(n);
// Loop through training and testing
for (let epoch = 0; epoch < total_epoch; epoch++) {
// calculate learning rate (decreases as time goes on, but will not reach 0)
let lr = 1 / (2 * 100000 * (epoch + 1));
// training loop
for (let i = 0; i < training_set.length; i++) {
let label = training_set[i][1];
let output = n.forward(training_set[i][0]);
let dc = cost_derivative(label, output);
n.backpropagate(dc);
n.learn_from_grads(lr);
}
console.log("epoch " + (epoch + 1) + "/" + total_epoch + " completed");
// testing loop
let correct = 0;
for (let i = 0; i < test_set.length; i++) {
let label = test_set[i][1];
let output = n.forward(test_set[i][0]);
if (Math.abs(label - output) <= label * (ERROR_MARGIN / 100)) {
correct++;
}
let correct_prcnt = utils.round_n_places(
100 * (correct / test_set.length),
2
);
console.log(
"Accuracy: " +
correct +
"/" +
test_set.length +
"(" +
correct_prcnt +
"%) correct within -/+" +
ERROR_MARGIN +
"%"
);
}
}
run_demo(n);
As you can see, the code performs 10 iterations, or epochs. In each of them, a training loop goes through each row of the dataset, feeding the input into the neuron and calculates current cost gradient based on the output. The gradient is then backpropagated through the neuron to the weight gradients, which the neuron learns from before moving to the next row.
The learning rate is calculated based on current epoch number so that it shrinks each epoch. My initial learning rate is a quite small in comparison to other examples that I have seen, however starting with a larger learning rate caused the neuron’s gradients to grow instead of shrink. These large gradient values resulted in constant overstepping until the gradient values hit Infinity.
The output of the code is as follows:
Demo:
Expected : 2.2 + 1.5 = 3.7
Actual : 2.2 + 1.5 = 2.358443202446961
Expected : 1024 + 2048 = 3072
Actual : 1024 + 2048 = 1969.8638042934172
epoch 1/10 completed
Accuracy: 23744/50000(47.49%) correct within -/+0.01%
epoch 2/10 completed
Accuracy: 31368/50000(62.74%) correct within -/+0.01%
epoch 3/10 completed
Accuracy: 37322/50000(74.64%) correct within -/+0.01%
epoch 4/10 completed
Accuracy: 42572/50000(85.14%) correct within -/+0.01%
epoch 5/10 completed
Accuracy: 46947/50000(93.89%) correct within -/+0.01%
epoch 6/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
epoch 7/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
epoch 8/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
epoch 9/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
epoch 10/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
Demo:
Expected : 2.2 + 1.5 = 3.7
Actual : 2.2 + 1.5 = 3.700334223241915
Expected : 1024 + 2048 = 3072
Actual : 1024 + 2048 = 3072.2695417020814
From the first epochs, we can see that our neuron doesn’t perform well at the beginning, however, after several epochs, its accuracy reaches 100% with a tolerance of -/+ 0.01% (i.e. predictions and ground truth don’t differ by more than this relative value).
Yay! I have a working neuron!
Issues/Flaws with my Adding Neuron
There were a number of issues and flaws that came up along the way, two of which are highlighted here. These issues can be linked to both the ad-hoc nature in which I approached the problem and the specific function the neuron had to learn, which was addition.
Exploding Gradient Problem [1 – Chapter 5] – In some earlier versions of the my implementation, I continued to get NaN values (back)propagating through the neuron. While debugging I found that instead of the absolute value of the gradient becoming smaller in each iteration, the value was rapidly growing and the neuron was stepping past its previous position. This issue was caused by the linear nature of my neuron, which means the weight gradients are generally larger than the output error margin. For example, imagine neuron outputs 30 instead of 50, the error margin of -20 would result in some very large gradients. To overcome this issue I simply tried progressively smaller initial learning rates until I no longer received NaN values. However, this work around isn’t a generalisable solution as it depends on the lower and upper bounds of the dataset. In a real world scenario the inputs generally normalise which prevents such large values occurring.
Redundant Regularisation – In the domain of neural networks regularisation is a process used to avoid overfitting. While overfitting is an issue in real world problems, addition is a well defined problem where overfitting is not possible. Therefore the regularisation is not needed in this case. When I tried dropping the regularisation operation from the learn_from_grads
function, the accuracy of the neuron reached 100% in the first epoch. Admittedly there is some minuscule error in the final demo, likely caused by a rounding error when calculating the the weight gradient. However, I decided to leave the regularisation function in as it is more interesting to see the neural gradually learning.
Demo:
Expected : 2.2 + 1.5 = 3.7
Actual : 2.2 + 1.5 = 2.358443202446961
Expected : 1024 + 2048 = 3072
Actual : 1024 + 2048 = 1969.8638042934172
epoch 1/10 completed
Accuracy: 50000/50000(100%) correct within -/+0.01%
...
Demo:
Expected : 2.2 + 1.5 = 3.7
Actual : 2.2 + 1.5 = 3.7000000000000006
Expected : 1024 + 2048 = 3072
Actual : 1024 + 2048 = 3072.0000000000014
Conclusion
In this blog post I presented my own implementation of an artificial neuron, with the aim of explaining the process by which how neural networks can learn. The implementation focuses on the simplistic problem of Addition. After a brief coverage of neural network terminology, I introduced the basic structure and forward feed methods of my Neuron class. Gradient Descent is then explained as the iterative process that provides direction on how the output of the neuron needs to change. However, we do not have direct control over the output of the neuron; hence, we need to employ backpropagation to move the direction back to the weights that we do control. I also touched on the use of learning rates to avoid overstepping before presenting an example of the neuron learning to add, as well as a couple of interesting issues with my implementation. In the end this implementation help me learn more about neural networks, and I hope it helps you too.
If you want to learn more about Neural Networks, I would recommend reading the two references below, as well as watching the Welch Lab’s Demystifying Neural Networks video series for a cool visual explanation.
References
- Michael Nielsen. Neuron Networks and Deep Learning. Link
- Andrej Karpathy. Hacker’s Guide to Neural Networks. Link
Header image courtesy of George Becker.