As a programmer it’s hard to keep up with all the areas of technology so you end up picking a few and trying to dig down into those while trying to maintain at least a cursory knowledge of as many other technologies as possible. Since the buzz over machine learning and artificial intelligence has only been increasing I decided to take a jump in the shallow end. There seems to be two ways into it: the first one being concepts and calculus, the other is grabbing an existing framework and embracing the ‘black box’ approach. Without knowing too much I started with a feed forward neural network Swift library, it was fun to play with but I quickly ran into not being able to do much as my knowledge of the internals limited my implementation. Then Apple announced that the Accelerate and Metal frameworks would support features for convolutional neural networks in iOS 10 and I decided it was time to dig in.

After looking around I found Andrew Ng’s Coursera class comes highly recommended. It’s about a ten week course and the consensus was that you didn’t need to be a practicing mathematician to take it. I enrolled and after having finished the class I decided to try and write down ‘my understanding’ on some of the concepts. I find that writing about something helps me solidify understanding and if it helps someone else (either understand or feel better about your own understanding) then that’s always fantastic. And if you see a way to optimize something please let me know – no programmer wants to do things the hard way.

Many of the machine learning algorithms solve ‘classification’ problems, something like feeding in a picture and the algorithm identifying if it contains a person, car, etc. We aren’t going to start there though, we are going to start with linear regression or something more akin to an algorithm that takes known data and makes a prediction based on a linear trend such as the commonly used example of home values over time which usually have a trend of increasing. While this is certainly the shallow end of the pool, the concepts such as cost function are commonly recurring in machine learning.

Let’s say we have some data that looks like this:

X |
Y |

1 | 2.67 |

2 | 2.76 |

3 | 2.69 |

4 | 2.82 |

5 | 2.94 |

6 | 2.99 |

And if we plot it we get something like this:

What we want to do is predict what the value of Y would be when X is 8. Since this is called linear regression our prediction will be linear and look like the blue line:

To draw that line we will use the intercept and slope and we’ll call that theta. Theta will therefore have two components (intercept/slope or rise/run) which we will call theta zero and theta one. What we want to do is try different values of theta zero and theta one to draw lots of ‘hypothetical’ lines and calculate the ‘cost’ of that line. For example if we drew a line with a theta of 1,1 and our prediction looks like this (blue line is theta, green is our training data aka what we know):

The ‘cost’ is the difference between our prediction and the actual data, which could be visualized at each data point something like this:

We will add up all those costs to calculate how well that line fits our data. To calculate the cost of a single point, the formula is: ((θ0 + θ1 * X) – Y) The formula that calculates the cost is conveniently referred to as the cost function.

You’ll often hear that we want to ‘minimize’ the cost function. What that means is we want to find the cheapest line, aka the one that best fits the data. In programmer terms we could calculate a bunch of possibilities and save their costs to an array, then find the lowest cost value in the array and use that as the best fit. But how do we know which direction to adjust our line to best fit the data? That’s where gradient descent comes in, think of it as a function that you give your data to and then it finds the ‘way downhill’.

We will provide our gradient descent function with an alpha which is called the ‘learning rate’ which will help us move incrementally toward the lowest cost and we will also tell it how many iterations we would like to perform. Obviously the fewer iterations the faster it will run but the less accurate our result will be. In this limited example you could tell it to run 500 times or 5000 without any real noticeable performance impact but you could imagine this might not be the case if you have tens of thousands of training data.

If you want to see it work, your gradient descent function can start off with an empty array (or vector) which will be the same size as the number of iterations to store our cost history for each time through the loop. For every iteration of the function we could add our new cost data to the array.

Since the gradient descent formula automatically moves ‘downhill’ our ending value of theta will be the cheapest line we could calculate. It’s worth noting however that this naive implementation is susceptible to something called local minima, which means gradient descent could get caught in a ‘valley’ that’s not actually the cheapest line but it sees no way but up and gets stuck.

For something of a more concrete example, I grabbed some user data from Nissan Leaf owners about battery capacity loss. We will use the x axis to represent number of miles driven and the y axis to represent number of capacity bars lost. In fairness this isn’t meant to be an accurate prediction of battery capacity, there’s a lot more to consider such as climate/temperature, age of battery, etc but for simplicity we aren’t going to get into multivariate calculations here.

We will put the mileage into an array called ‘milesDriven’ and the number of capacity bars lost into another array called ‘barsLost’. Now if you look at that range of values in milesDriven it covers quite a distance from 0 to almost 130,000. This brings us to something called feature scaling/normalization which is basically saying make all the values fall between zero and one. The mean value should be 0.5 and the highest value be 1.0 and of course the lowest value becomes 0.0:

x = (x – min(x)) / max(x) – min(x)

The map function is perfect for succintly applying this to the entire array. Next we will normalize the barsLost data for consistency, and set the iterations and learningRate (alpha) values. These are values that you should experiment with to see the changes it makes to the overall output. Likewise for predictionMiles, set it to your target miles; for example put in 23000 to have the algorithm calculate capacity loss for 23000 miles. Obviously, we normalize this value to match our arrays.

The JHistory array is created and initialized so we can visualize the cost history as the algorithm runs. We init it with nil values otherwise it might look like an actual calculated value in the history and throw us off. The two main functions are pretty much explanatory, the predictedBarsLost takes an intercept and slope and calculates the predicted capacity loss. While the findBestFit function takes our (normalized) data arrays as input along with the number of iterations and alpha/learning rate. Starting with a line of 0,0 (intercept and slope) it then runs for the number of iterations and for each iteration it adjusts the intercept and slope and calculates the cost and adds that to the History array. Ultimately it returns an intercept and slope tuple fit to the data.

We can then take the output intercept and slope to predict capacity loss at a given mileage and convert those back to regular values. Something to watch is the JHistory array as it runs, it should decrease:

Playgrounds are perfect for this type of thing – try playing with the iterations and learningRate values to see how they impact performance.

Some great posts about gradient descent and linear regression:

- An Introduction to Gradient Descent and Linear Regression
- Linear Regression in Swift: Gradient Descent