(This topic is also in Section 1.4 in Finite Mathematics, Applied Calculus and Finite Mathematics and Applied Calculus)
For best viewing, adjust the window width to at least the length of the line below.
Linear regression is a method of finding the linear equation that comes closest to fitting a collection of data points. For example, here is a some data showing the number of households in China with cable TV.*
Year (x) (x = 0 represents 2000) | 0 | 1 | 2 | 3 |
Households with Cable (y) (Millions) | 68 | 72 | 80 | 83 |
If we plot these data, we get the following graph.
Although no straight line passes exactly through these points, there are many straight lines that pass close to them. Here is one of them.
Q How good an approximation is the line to the data?
A Suppose that we used the line rather than the data points to estimate the number of households with cable. Then we would get slightly different values from the original observed values shown above. These values are called predicted values.
Year (x) | 0 | 1 | 2 | 3 |
Observed value of y | 68 | 72 | 80 | 83 |
Predicted value of y | 62 | 70 | 78 | 86 |
The better our choice of line, the closer the predicted values will be to the observed values. The difference between the predicted value and the observed value is called the residue.
Residue = Observed Value - Predicted ValueOn the graph, the residues measure the vertical distances between the (observed) data points and the line
Year (x) | 0 | 1 | 2 | 3 |
Observed value of y | 68 | 72 | 80 | 83 |
Predicted value of y | 62 | 70 | 78 | 86 |
Residue | 68-62 = 6 | 72-70 = 2 | 80-78 = 2 | 83-86 = -3 |
Notice that some residues are positive and others negative. If we add up the squares of the residues, we get a measure of how well the line fits, called the sum-of-squares error.
Residues, Sum-of-Squares Error (SSE)
A residue is the difference between an observed and predicted value of a function. (A predicted value means a value given by some mathematical model.) Residue = Observed value - Predicted value The sum-of-squares error (SSE) when observed data are approximated by a function is given by
The smaller SSE, the better the approximating function fits the data.
Example Referring to the above example, the sum-of-squares error is SSE = 62 + 22 + 22 + (-3) 2 = 53 |
Q OK. So what is the regression line?
A The regression line is the line that gives the smallest possible value of SSE..
Q How do we find this line?
A There are a variety of ways of finding it, since most forms of technology have built-in regression routines. Here is one on this web site. However, it is nice to be able to compute the regression line by hand, and this is what we do next.
Computing the Regression Line
The regression line (least squares line, best-fit line) associated with the points (x1, y1), (x2, y2), ... , (xn, yn) is the line that gives the minimum sum-of-squares error (SSE). The regression line is y = mx + bwhere m and b are computed as follows. Here, "Σ" means "the sum of." Thus, for example, Σx = Sum of the x-values = x1 + x2 + . . . + xnOn the other hand, (Σx)2 = Square of Σx = Square of the sum of the x-valuesFinally, n = Number of data points |
The easiest way to compute these values is by using a table, as we show in the following exercise, where we will compute the regression line for the above data.
There is also material in the book about the correlation coefficient r, which, like SSE, is a way of measuring the goodness of fit of a line to the given data. However, it is more useful than SSE for comparing the goodness of fit of different lines to different data. You will need to read the material on computing r before you can answer all of the exercises in Section 1.5 of the textbook. Also, press "Review Exercises" on the sidebar to see a collection of exercises that covers the whole of Chapter 1.