1.4 Linear Regression

(This topic is also in Section 1.4 in Finite Mathematics, Applied Calculus and Finite Mathematics and Applied Calculus)

For best viewing, adjust the window width to at least the length of the line below.

Linear Regression Utility

Linear regression is a method of finding the linear equation that comes closest to fitting a collection of data points. For example, here is a some data showing the number of households in China with cable TV.*

Year (x) (x = 0 represents 2000)	0	1	2	3
Households with Cable (y) (Millions)	68	72	80	83

^*Data are approximate, and the 2001-2003 figures are estimates. Sources: HSBC Securities, Bear Sterns/New York Times, March 23, 2001, p. C1.

If we plot these data, we get the following graph.

Although no straight line passes exactly through these points, there are many straight lines that pass close to them. Here is one of them.

Q How good an approximation is the line to the data?
A Suppose that we used the line rather than the data points to estimate the number of households with cable. Then we would get slightly different values from the original observed values shown above. These values are called predicted values.

Year (x)	0	1	2	3
Observed value of y	68	72	80	83
Predicted value of y	62	70	78	86

The better our choice of line, the closer the predicted values will be to the observed values. The difference between the predicted value and the observed value is called the residue.

Residue = Observed Value - Predicted Value

On the graph, the residues measure the vertical distances between the (observed) data points and the line

and they tell us how far off the linear model is in predicting the number of households with cable. For our garph above, the residues are shown in the following table:

Year (x)	0	1	2	3
Observed value of y	68	72	80	83
Predicted value of y	62	70	78	86
Residue	68-62 = 6	72-70 = 2	80-78 = 2	83-86 = -3

Notice that some residues are positive and others negative. If we add up the squares of the residues, we get a measure of how well the line fits, called the sum-of-squares error.

Residues, Sum-of-Squares Error (SSE)

A residue is the difference between an observed and predicted value of a function. (A predicted value means a value given by some mathematical model.)

Residue = Observed value - Predicted value

The sum-of-squares error (SSE) when observed data are approximated by a function is given by

SSE = Sum of squares of residues

= Sum of (y_observed - y_predicted)²

The smaller SSE, the better the approximating function fits the data.

Example

Referring to the above example, the sum-of-squares error is

SSE = 6² + 2² + 2² + (-3) ² = 53

Q OK. So what is the regression line?
A The regression line is the line that gives the smallest possible value of SSE..

Q How do we find this line?
A There are a variety of ways of finding it, since most forms of technology have built-in regression routines. Here is one on this web site. However, it is nice to be able to compute the regression line by hand, and this is what we do next.

Computing the Regression Line

The regression line (least squares line, best-fit line) associated with the points (x₁, y₁), (x₂, y₂), ... , (x_n, y_n) is the line that gives the minimum sum-of-squares error (SSE). The regression line is

y = mx + b

where m and b are computed as follows.

m = n(Σxy) - (Σx)(Σy)

n(Σx²) - (Σx)²

b = Σy - m(Σx)

n

Here, "Σ" means "the sum of." Thus, for example,

Σx = Sum of the x-values = x₁ + x₂ + . . . + x_n
Σxy = Sum of products = x₁y₁ + x₂y₂ + . . . + x_ny_n
Σx² = Sum of the squares of the x-values = x₁² + x₂² + . . . + x_n².

On the other hand,

(Σx)² = Square of Σx = Square of the sum of the x-values

Finally,

n = Number of data points

The easiest way to compute these values is by using a table, as we show in the following exercise, where we will compute the regression line for the above data.

There is also material in the book about the correlation coefficient r, which, like SSE, is a way of measuring the goodness of fit of a line to the given data. However, it is more useful than SSE for comparing the goodness of fit of different lines to different data. You will need to read the material on computing r before you can answer all of the exercises in Section 1.5 of the textbook. Also, press "Review Exercises" on the sidebar to see a collection of exercises that covers the whole of Chapter 1.

Top of Page

x	1	2	3	4
y	1.5	1.6	2.1	3.0

x	1	2	3	4
y Observed	1.5	1.6	2.1	3.0
y Predicted
Residue

x	1	2	3	4
y	1.5	1.6	2.1	3.0