## 1.4 Linear Regression

(This topic is also in Section 1.4 in Finite Mathematics, Applied Calculus and Finite Mathematics and Applied Calculus)

For best viewing, adjust the window width to at least the length of the line below.

Linear Regression Utility

Linear regression is a method of finding the linear equation that comes closest to fitting a collection of data points. For example, here is a some data showing the number of households in China with cable TV.*

 Year (x) (x = 0 represents 2000) 0 1 2 3 Households with Cable (y) (Millions) 68 72 80 83
*Data are approximate, and the 2001-2003 figures are estimates. Sources: HSBC Securities, Bear Sterns/New York Times, March 23, 2001, p. C1.

If we plot these data, we get the following graph.

Although no straight line passes exactly through these points, there are many straight lines that pass close to them. Here is one of them.

Q How good an approximation is the line to the data?
A Suppose that we used the line rather than the data points to estimate the number of households with cable. Then we would get slightly different values from the original observed values shown above. These values are called predicted values.

 Year (x) 0 1 2 3 Observed value of y 68 72 80 83 Predicted value of y 62 70 78 86

Q From the data points of the predicted values, the equation of the line is:

 y =

The better our choice of line, the closer the predicted values will be to the observed values. The difference between the predicted value and the observed value is called the residue.

Residue = Observed Value - Predicted Value
On the graph, the residues measure the vertical distances between the (observed) data points and the line
and they tell us how far off the linear model is in predicting the number of households with cable. For our garph above, the residues are shown in the following table:
 Year (x) 0 1 2 3 Observed value of y 68 72 80 83 Predicted value of y 62 70 78 86 Residue 68-62 = 6 72-70 = 2 80-78 = 2 83-86 = -3

Notice that some residues are positive and others negative. If we add up the squares of the residues, we get a measure of how well the line fits, called the sum-of-squares error.

Residues, Sum-of-Squares Error (SSE)

A residue is the difference between an observed and predicted value of a function. (A predicted value means a value given by some mathematical model.)

Residue = Observed value - Predicted value

The sum-of-squares error (SSE) when observed data are approximated by a function is given by

 SSE = Sum of squares of residues = Sum of (yobserved - ypredicted)2

The smaller SSE, the better the approximating function fits the data.

Example

Referring to the above example, the sum-of-squares error is

SSE = 62 + 22 + 22 + (-3) 2 = 53

Consider the following observed data

 x 1 2 3 4 y 1.5 1.6 2.1 3

If you approximate the data using the linear equation

y = 0.5x + 1
Complete the following table and press "Check".
 x 1 2 3 4 y Observed 1.5 1.6 2.1 3.0 y Predicted Residue

Q Further, the sum-of-squares error for this approximation is

 SSE =

Top of Page

Q OK. So what is the regression line?
A The regression line is the line that gives the smallest possible value of SSE..

Q How do we find this line?
A There are a variety of ways of finding it, since most forms of technology have built-in regression routines. Here is one on this web site. However, it is nice to be able to compute the regression line by hand, and this is what we do next.

Computing the Regression Line

The regression line (least squares line, best-fit line) associated with the points (x1, y1), (x2, y2), ... , (xn, yn) is the line that gives the minimum sum-of-squares error (SSE). The regression line is

y = mx + b
where m and b are computed as follows.
 m = n(Σxy) - (Σx)(Σy)n(Σx2) - (Σx)2
 b = Σy - m(Σx)n
Here, "Σ" means "the sum of." Thus, for example,
Σx = Sum of the x-values = x1 + x2 + . . . + xn
Σxy = Sum of products = x1y1 + x2y2 + . . . + xnyn
Σx2 = Sum of the squares of the x-values = x12 + x22 + . . . + xn2.
On the other hand,
(Σx)2 = Square of Σx = Square of the sum of the x-values
Finally,
n = Number of data points

The easiest way to compute these values is by using a table, as we show in the following exercise, where we will compute the regression line for the above data.

Consider the above observed data

 x 1 2 3 4 y 1.5 1.6 2.1 3

Since the formula involves the quantities x2 and xy, as well as their sums, let us create a table with corresponding headings, as shown:

Sum

 x y x2 xy 1 1.5 12 = 1 (1)(1.5) = 1.5 2 1.6 22 = 4 (2)(1.6) = 3.2 3 2.1 4 3.0 10 8.2

Q Now fill in the quantities needed in the computation of the slope m and the y-intercept b, and press "Check".

m=
n(Σxy) - (Σx)(Σy)

n(Σx2) - (Σx)2
=
 ()() - ()()()() - ()2 =

b=
Σy - m(Σx)

n
=
 - ()() =

There is also material in the book about the correlation coefficient r, which, like SSE, is a way of measuring the goodness of fit of a line to the given data. However, it is more useful than SSE for comparing the goodness of fit of different lines to different data. You will need to read the material on computing r before you can answer all of the exercises in Section 1.5 of the textbook. Also, press "Review Exercises" on the sidebar to see a collection of exercises that covers the whole of Chapter 1.

Last Updated: March, 2006