### Machine Learning – Neural Networks (Part 1)

So this is my first blog post ever… Decided to start a blog because writing down what you learn requires you to understand it more deeply. It might also validate what you study through critical comments on wrong things you write publicly – Challenge accepted! 🙂

So I recently set out to study some basic concepts of Machine Learning for my start-up. I understood that a problem that I have can be fixed using some kind of elastic algorithms that can adjust to changes in features and correlations.

The third topic that I’ve came across in this amazing course that I’m taking online is about Neural Networks (The two first topics are talking about Linear Regression and Logistic Regression that I may cover some other time). This is a post for people who already know some basic practices of Machine Learning as I skip some basic concepts along the way although the first part is more theoretical and makes sense without any prior knowledge.

Also, for now I will be giving the code implementation examples in Octave, but I might switch to python as I like it better.

### Neural Networks

More formally, ANNs (Artificial Neural Networks) is a group of statistical learning algorithms that emerged from the concept of biological neural networks, which have the properties of receiving inputs and approximating non-linear functions as outputs, in a manner that is adaptive, meaning that by tuning weights that exist in the neural network will change the output function (See more in wikipedia: Artificial Neural Network).

### Motivation

Taken from wikipedia: “In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).”

We will soon see that the neural networks method does a good job for learning non-linear classification hypotheses as opposed to logistic/linear regression that only do fine for linear hypotheses. For example, when you try to learn a data-set with many features (= the relevant properties that help to classify a data-set), let’s say 100, and the optimized hypothesis is non-linear, let’s say of an order of 4, then you will need $~100^4 = ~100,000,000$ features to learn, which is very expensive to calculate with logistic/linear regression methods.

More concretely, trying to classify images (for example: contains a car or not) can be extremely difficult, as a small image has 50×50 = 2500 pixels. This means we have 2500 features, and if were to learn all the non-linear hypothesis by including all of the quadratic features, meaning all of the combinations of Xi * Xj (Where Xi is pixel i) we would end up with a total of ~3,000,000 features which is practically very expensive to compute – and this is for every train example!

Every pair of 2 pixels for different training examples

As we can see in this example, the function that might classify between a car and non-car image might be non-linear on the pixel-level, and when aggregated together to all of the pixels, it might form a much more complex function. In order to be able to form such function effectively we need different tools than logistic regression.

### Modeling

So how does neural networks solve these problems effectively? Let’s start with the modeling.

We will use an “input” layer that will receive the initial data to process (whether for training or for predicting). These will point to “hidden” layers that will be used to calculate more complex functions and finally will be outputted to the “output” layer that is actually our hypothesis h(X).

Neural Networks Modeling

In graphs notion, every node v has a value calculated by a function on the weights of (u,v) for all u from the former layer. In addition we often add a “bias” node to every layer that is used to, well, deal with biased calculations 🙂

Let’s compute the hypothesis for our visual example:

$a^{(1)} = x$

$z^{(2)} = \theta^{(1)}a^{(1)}$

$a^{(2)} = g(z^{(2)})$

$z^{(3)} = \theta^{(2)}a^{(2)}$

$a^{(3)} = g(z^{(3)})$

$h_\theta = a^{(3)}$

Where:

$a^{(i)}$ is the vector of the computed values of layer i.

$\theta^(i)$ is a matrix that holds weights of edges between layer i and i+1.

$g = \frac{1}{1 + e^{-z}}$ is the sigmoid function that looks like:

The sigmoid function

(What’s interesting and helpful about the sigmoid function is that it ranges from 0 to 1 monotonically, which acts as a convenient mathematical way of representing binary decisions: 0 or 1 for every result).

What we actually did here is to just start from layer 1 which represents our data, calculating and regularizing it to produce the second layer, then keep doing that until we reach the last layer, which represents a quite complex function which is the hypothesis for our original data (Did the image contain a car object or a non-car object?).

### Some Intuition

So when I keep saying that it is easy to compute complex functions with neural networks, how does it look like? Let’s get some intuition of what can be accomplished with our new model.

Computing logical operators are a good example, because they are the basic of much more complex functions. In the following graph we can see 4 points which are classified into 2 groups: O’s and X’s.

As we can see, the actual function being computed here to classify them correctly is the XNOR Function, which stands for NOT (EXCLUSIVE OR), meaning that only if $x_1$ and $x_2$ has different values then we they are O, otherwise they are X. The actual equation is:

$y = x_1\:XNOR\:x_2$

XOR Operand is a good candidate for a classification hypothesis in this case.

And as we gather many data-points, we can see that if we want a good function to classify them, then XNOR fits here quite well.

Let’s see how we can compute such a function with Neural Networks. Let’s observe that

$XOR = (x_1\:AND\:x_2)\:OR\:(NOT(x_1)\:AND\:NOT(x_2))$

Then what we need to do is chain 2 ‘AND’ functions with ‘OR’ and we will get XOR, then it’s complement is NOT(XOR) = NXOR.

Let’s compute AND. For $x_1, x_2 \in {0,1}$ we model $y = x_1\:AND\:x_2$ in the following way:

AND Function computed using our model.

So we receive $h_\theta(x) = g(-30 + 20x_1 + 20x_2)$. We remember that g is the sigmoid function which receives values close to 1 from x=4 and up and very close to 0 from x=-4 and down. So we receive the following truth table:

Truth table of AND operator.

This is exactly the truth table of the AND operator! We will compute OR in the same way and assemble them together to receive:

As we can see, using a neural network with 1 hidden layer we managed to compute the XNOR function.

### Practical Example (Code)

Let’s say we want to use neural networks to learn handwritten digits. So when given an image of a handwritten digit we would like to predict which digit is it.

Handwritten digits

Let’s assume for this phase, that we already have a matrix $\theta$ that represents weights that were already trained (meaning we can use them to predict new inputs). So all we have to do is what we have shown in theory above – just calculate the hypothesis.

Here’s the code for prediction:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

 load('ex3data1.mat'); % Loads X – The data-set to predict load('ex3weights.mat'); % Loads Theta1, Theta2 – A trained Neural Network. % Compute sigmoid function function g = sigmoid(z) g = 1.0 ./ (1.0 + exp(-z)); end % The prediction function function p = predict(Theta1, Theta2, X) a1 = [ones(size(X), 1), X]; % Extend X with 'bias' node, this is "a(1)". z2 = a1 * Theta1'; % Multiply Theta1 by X' to activate first layer. a2 = [ones(size(z2), 1), sigmoid(z2)]; % Extend X with 'bias' node, this is "a(2)". z3 = a2 * Theta2'; % Multiply Theta2 by layer_1 to activate second layer. a3 = sigmoid(z3); % In our case the third layer is actually the final hypothesis h(theta). h = a3; [pred_max, index_max] = max(h, [], 2); % Collect the predictions for each sample. p = index_max; % Return it as a column vector. end pred = predict(Theta1, Theta2, X); fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

view raw

predict_nn.m

hosted with ❤ by GitHub

This implementation of a data-set prediction receives a trained neural network that is represented by variables Theta1, Theta2 which are Theta matrices that contain the weights between our 3 layers. It also receives a variable X, which is our data-set that we want to predict. This implementation is vectorized, meaning X can be a single image represented by a vector of pixels or several images represented by a matrix of images as rows.

It calculates the hypothesis of each of the images in X in the same way we have seen before. The last layer contains 10 nodes to represent digits 0-9, and $h_\theta = a^{(3)}$ will hold for every image in X the probability that it is the digits, for every digit. This means we have to take a maximum for each row, so we pick the digit that has the highest probability of being correct.

Complete source code can be accessed here.

### Summary

Over a training set of 5000 images, we receive ~97.5% accuracy – This is really amazing when you think about it: An algorithm that receives images of handwritten digits (that have many variations) and manages to predict over 97% of them correctly!

In the next post I will reveal the magic – How do you efficiently train the neural network (Which is represented by the $\theta$ matrices) to produce such stunning results??

### Credits

Most of the attached screenshots are taken from the presentations of the course that I take. All of the credit goes to the course’s creator: Andrew Ng.