Exercise: Linear Regression

We will finish this chapter with an exercise that makes use of some of the basic syntax elements described before. Our goal is to implement linear regression. During the course of the workshop, we will come back to this example several times. Let's first have a look at what linear regression is: Suppose you observe the two variables education (in years) and income (per month in dollar), and you want to predict a persons income based on their education. One way to do this is linear regression. Suppose we conducted a study and observed the following data points:

data

On the x-axis, you see the years of education, and on the y-axis you see the monthly income. The task of linear regression is to find a straight line that best describes this relationship:

lr

As you may recall from university, school, or learn just now, a straight line is mathematically described by $y = \alpha + \beta x$, where in our case, $y$ corresponds to income and $x$ to years of education.

Exercise

Write a function predict that takes x, α and β as inputs and returns the predicted value for y.

show solution
Solution
function predict(x, α, β)
    y = α .+ β*x
    return y
end


Let's simulate some data for the task at hand. First, we load the Random package (a Julia package for random number generation) and set a seed (to make our experiments reproducible):

using Random
Random.seed!(1234)
Random.TaskLocalRNG()

Next, we generate some random values for years of education:

x = 10 .+ 3*randn(20)
20-element Vector{Float64}:
 12.911968986565643
  7.062344765394401
 12.705582650782281
  9.901590612266082
  8.197623329933316
  5.664468654141301
 18.12227182514734
 14.573343590306788
 12.279412060022398
  7.355689281410555
 12.117979401347831
 13.274665846160229
 12.614493558642724
 10.257073327019132
 12.880237454462616
 12.723510996809502
  9.383108871915919
 12.312109525273629
  6.358179486488937
 13.274495746205444

This produces a vector of 20 values with 10 years of education as the average, and some normally distributed random variation (with mean 0 and standard deviation 3).

Exercise

Use your previously defined function predict to generate some values for income (y) with α = 1000, β = 300

show solution
Solution
y = predict(x, 1000, 300)
20-element Vector{Float64}:
 4873.590695969693
 3118.70342961832
 4811.674795234685
 3970.4771836798245
 3459.2869989799947
 2699.3405962423903
 6436.681547544202
 5372.003077092037
 4683.82361800672
 3206.7067844231665
 4635.39382040435
 4982.399753848069
 4784.348067592817
 4077.1219981057397
 4864.071236338785
 4817.053299042851
 3814.932661574776
 4693.632857582088
 2907.453845946681
 4982.348723861633


Since in reality, income does not perfectly depend on education, but there is some random variation, we add this random variation to y:

y += 500*randn(20)
# In case you don't know the "+="-sign:
# This is the same as writing y = y + 500*randn(20)
20-element Vector{Float64}:
 4823.645221195984
 2755.6322668011885
 4870.718613298108
 4441.843603957124
 3454.3306407860846
 2580.7422256856644
 6319.658088985575
 5976.1103158123615
 5659.967831314555
 3368.559310292467
 4256.475254144916
 5567.071070065267
 5523.281634582452
 4050.7583412019408
 4981.319314300481
 4810.9504254983985
 3963.103061086381
 5096.559166818264
 2859.8794629077993
 5348.951251046224

And voilà! We have some data to work with.

In reality of course, we don't know the values for $\alpha$ and $\beta$, but we have to estimate them from the data. To do so, we first need some indication of how good a certain combination of values works for our data. Usually, we use the sum of squared errors for this task:

\[\sum_{i=1}^n (\hat{y}_i - y_i)^2\]

So we go through all of our n data points (i = 1, ..., n) and for each of those data points we compute the squared distance between the prediction, $\hat{y}_i$, and the value we observed in reality, $y_i$.

Exercise

Define a function squared_error that takes a vector of predicted values and a vector of observed values as input and computes the sum of squared errors between them.

show solution
Solution
function squared_error(y, ŷ)
    return sum((y - ŷ).^2)
end

(In case you are wondering, can by typed as y\hat<tab>)


Exercise

Using your previously defined predict and squared_error functions to define a function squared_error_regression that takes as input values for α, β, x and y and returns as output the squared error between predictions and observed values. Then, use this function to compute the squared error for the parameter values 1. β = 100, α = 200 and 2. β = 300, α = 1000

show solution
Solution
function squared_error_regression(α, β, y, x)
    return squared_error(y, predict(x, α, β))
end

squared_error_regression(200, 100, y, x)
squared_error_regression(1000, 300, y, x)
2.1926837150818703e8
3.098658500723127e6


You should see that the error corresponding to the true parameters we used to simulate the data is much lower.