Mathematics Monday: Simple linear regression (revision)

Mathematical compass:
Someone recently suggested I make mathematics my compass (while I was busy diving down rabbit holes reading up on twistors and the like). Maybe they’re correct. I realised that if I was going to do that, I’d need to brush up on my map-reading skills first!

I’ve been so busy engrossed in learning Python I’ve sorely neglected the other streams of the training syllabus I set myself back in late February (is it really that long ago?). So I decided I need to dedicate at least one day a week to bringing my maths back up to speed.

I’m having to start almost back at square one. (Demoralising.) Calculus is a tad rusty to say the least: serious revision required. Did I really quit accountancy for this? Appears to be madness. Surely there’s some hidden logic in there somewhere, it just seems pretty elusive right now. Well, onwards and upwards anyway…

Linear regression techniques with Python:
In an earlier post I looked at Linear Regression in relation to simple, near-future predictive models, as viewed from a marketing or enterprise perspective. Here we dig deeper into the mathematics, then proceed to explore regression analysis in Python.

We’ll want to use a linear regression model where we have access to some sample data and wish to make predictions about what will happen to data points outside the original sample. Maybe the available data is historical (period 1) data and we want to predict what will happen in the near-future (period 2). We may wish to determine likely outcome of, for example, the number of times an event will occur in the next period, or whether or not the event will occur in the next period. We can also use linear regression to assess, for a given y, which set of data points x₁, x₂, .. x_n are likely to have a relationship with y and which are not.

The simplest regression model quantifies a linear relationship between two variables, one independent and the other dependent (e.g. price and sales demand respectively). The model (equation) takes on the standard form:

\[y = ax + b\]

The easiest way to see if we have a linear relationship is to take a sample data set (in machine learning this is often referred to as the training data), plot as a series of (x,y) coordinates on a graph, and see if the scattered points tend to describe a straight line (or close to it).

Assuming our data set shows a linear relationship, we’ll want to find the line of best fit to the sample data; this line becomes our prediction. We can either do this graphically (fitting the line to the graph by eye, for lower accuracy) or we can solve mathematically (for highest accuracy).

We can define a predictive model ŷ = a.x + b which will allow us to establish the dependent variable (ŷ) for any given value of independent variable (x). Here, ‘ŷ’ is the prediction, ‘a’ indicates the gradient of the straight line, and ‘b’ the y-axis intercept.

But we also want to determine the accuracy of our prediction line, so we establish an Error Function, E, which is the sum of the squares of all differences between actual data points and their predicted values. This can be expressed as:

\[E = \sum_i^n(y_{i}-\widehat{y}_{i})^2\]

(summed for all values of i with i = 1 .. n)

We use the sum of the squares of the differences, since this yields a positive value (just summing the differences would result in positive and negative differences cancelling one another out and overstating the accuracy of the line fit). Also, the resultant sum of squares, when graphed, returns a parabola (with equation of standard form: y = a.x² + b.x + c), which is easily solved for the minimum of Error Function, E (i.e. where the accuracy of line fit is greatest) by taking derivatives.

We need to take the derivative of the Error Function itself. We’ll want to solve for the minimum of E, i.e. where the gradient is zero at the base of the parabola. Therefore, we’ll want to find the derivative of E wrt a, and set that to zero.

Note that, since E is a function of ŷ, i.e. E=f(ŷ), and ŷ is itself a function x, i.e. y=f(x). But in the case where we wish to find the minimum, x and y will be a set of data points – such that both x and y are constants. In this way we can also say that y is then a function of the two parameters, a and b.

Looking at the first case, where y=f(a), we can therefore say that E is a function of a function of parameter a, i.e. E=f(f(a)). Hence, to find the derivative of E wrt a, we’ll need to use the Chain Rule, and – since we’re looking for the minimum of the parabola – we’ll set it to zero.

Here’s a quick recap of the Chain Rule. Where we have two functions, z=f(y) and y=f(x), we can find the derivative of the composition of these two functions as follows:

Applying the chain rule to the derivative of E wrt a:

which is

Differentiating gives

So,

Setting to zero gives

then dividing by 2 and multiplying out brackets gives

which can be rearranged as

Substituting for ŷ = a.x + b gives us or

which can be restated as

So the derivative of E wrt a, at the minimum, is Equation (i):

Next, we repeat the same process for the second parameter, b. In this case, y=f(b), so we can say that E is a function of a function of parameter b, i.e. E=f(f(b)). To find the derivate of E wrt b, we again use the Chain Rule, and set to zero:

Applying the chain rule to the derivative of E wrt a:

which is

Differentiating gives

So,

Setting to zero gives

then dividing by 2 and rearranging gives

Substituting for ŷ = a.x + b gives us or Now since we’re summing for all values of i with i = 1 .. n, then ∑b = b.n, so we can restate the above as

Dividing all through by n gives

Since ∑y/n (i.e. the sum of all y values divided by the number of y values) is the arithmetic mean of y (y-bar), and likewise ∑x/n is the arithmetic mean of x (x-bar), we can restate the above as

So the derivative of E wrt b, at the minimum, is Equation (ii):

We now have two equations, (i) and (ii), both containing two unknowns (parameters a and b):

(i)

(ii)

We want to solve these equations, and the simplest way to do this is by using substitution. (A more difficult alternative would be to solve with matrices – that’ll be reserved for more complex problems, and a revision lesson saved for another day!)

First we rearrange Equation (i) for a: Next we rearrange Equation (ii) for b:

Now we can substitute for (ii) b = … into (i) a = … and simplify down as far as we can (all steps shown):

Let’s call this Equation (I):

Finally, we substitute for Equation (I) a = … back into (ii) b = … and simplify down as far as we can (all steps shown):

Let’s call this Equation (II):

We went through a lot of algebraic pain there, but we ended up with two equations, (I) and (II), which we can solve to give the minimum of our Error Function, E (and hence the closest fit to our original linear model y = a.x + b):

(I)

(II)

They look pretty horrendous, right? Fortunately we can write an algorithm in Python to handle these equations for us, so no sweat. We’ll pick that up in the next post (coming soon).

Footnote (mathematical notation in HTML and WordPress):
While preparing the above, I had to find the relevant mathematical symbols. The most simple solution – where the standard symbols are available – is to cut and paste from a word document (or to use the ‘insert symbol’ function in WordPress). If you don’t have access to Word, and/or the range of symbols here in WordPress doesn’t cover your needs, I found another way (courtesy of: http://symbolcodes.tlt.psu.edu/accents/charmap.html).

Search for the Windows Character Map; this will give you a far greater range of symbols to choose from. They can be copied/pasted into WordPress using the paste Word function as if coming from a Word .doc. Here’s a handy list of some key characters to copy and paste from:

∑ , ∂ , ŷ , ӯ , ^ , ̂ , ̄ ,

=============

You can also use tags within the HTML to defined superscript .. or subscript .. text (courtesy of same website: http://symbolcodes.tlt.psu.edu/bylanguage/math.html#kbd). For example:

Code: xx+1 Result: x^x+1

Code: xi+1 Result: x_i+1

For latext-style editing and representation of advanced mathematical formulas on blogs and websites, another option is to use the web-based editor www.hostmath.com, which provides the embedded code to be pasted directly into an HTML page. For example, the embedded code

\[\frac{-b\pm\sqrt{b^2-4ac}}{2a}\]

will be displayed as:

\[\frac{-b\pm\sqrt{b^2-4ac}}{2a}\]

There’s a whole variety of other Unicode math characters available too. Here are some useful links: Math Symbol Unicode chart and Mathematical Operator Unicode table. In any event, you’ll need to declare the HTML page as a Unicode page with the following meta-tag (although in most WordPress themes this will come as standard).

For more on character recognition and encoding, see the Unicode website: http://www.unicode.org/. For more on specifically maths notation (downloadable PDF) try: http://www.unicode.org/reports/tr25/.

Just a little light bedtime reading.

Read more like this:
This post is the first in a series where I get to grips with mathematical skills for artificial intelligence. There’s various branches of AI, but here I’ll be focussing on those I’m most interested in: machine learning, neural networks, natural language processing.

I’ll be recording my own self-guided learnings (part of the Mathematics Stream of my Personalised Training Plan – you can see the summary syllabus here), as well as responses to and learnings from various MOOCs and online courses I’ll be taking.

First up: the Deep Learning in Python series on Udemy by Justin C. (see References below for links). Permission to blog about this series of courses was kindly given by the instructor. I highly recommend you take his course(s) yourself over on Udemy; there’s no substitute for direct experience when learning something new.

(PS: I don’t receive compensation for recommending any of these paid MOOCs or courses, just so ya know : ) )

References:
Course: Deep Learning Prerequisites: Linear Regression in Python, Justin C. (website \\ Udemy instructor page)

Some further reading (pdf downloads + useful links):
Linear Regression (Appendix 8.1), Jeremy Orloﬀ and Jonathan Bloom (MIT Open Courseware, Mathematics)
Linear Statistical Models: Simple Linear Regression, Davar Khoshnevisa (Univ. of Utah, Mathematics Dept. \\ all lecture notes)