Online text Exercises for This Topic Textbook Regression Tool Tutorial Español
Finite mathematics on-line topic: linear and exponential regression

Note: The mathematics in this page has been typeset using jsMath. To see the jsMath typesetting at its best you should install the jsMath TeX fonts. Click on the jsMath button at the bottom right corner of this page for more details.

We have seen how to find a linear model given two data points: We find the equation of the line that passes through them. (See the Topic Summary of Functions for some examples.) However, we often have more than two data points, and they will rarely all lie on a single straight line, but may often come close to doing so. The problem is to find the line coming closest to passing through all of the points.

1. Best Fit Straight Line (Regression Line)

We start with an attempt to construct a linear demand function. Suppose that your market research of real estate investments reveals the following sales figures for new homes of different prices over the past year.

Price (Thousands of $) 160 180 200 220 240 260 280
Sales of New Homes This Year 126 103 82 75 82 40 20

We would like to use these data to construct a demand function for the real estate market. (Recall that a demand function gives demand y, measured here by annual sales, as a function of unit price, x.) Here is a plot of y versus x.

The data definitely suggest a straight line, more-or-less, and hence a linear relationship between y and x. Here are several possible "straight line fits:"

Q Which line best fits the data?
A We would like the sales predicted by the best-fit line (predicted values) to be as close to the actual sales (observed values) as possible. The differences between the predicted values and the observed values, called the residual errors (or just residuals) appear as the vertical distances shown in the figure below.

Residual = Observed value - Predicted value pronosticado

Q So how do we do that?
A We add up all the squares of the residual errors to get a single error, called the sum of squares error (SSE) and we choose the line that gives the smallest SSE. This line is called the best fit line, regression line, or least squares line associated with the given data.

 Example 1: Calculating SSE for a given line

Suppose we wish to calculate SSE for a specific straight line, such as y = -x + 300 as shown below:

we have the following table of values:

y Observed
y Predicted
\hat{y} = -x + 300
y - \hat{y}
    160    126140-14

So, for the line y = -x + 300

Q So that is to calculate SSE for a given line. How do we obtain the equation of the best fit line; the line that gives the lowest value of SSE?
A Following is the formula for the best fit straight line. To justify it requires some calculus. Consult the chapter on several variables in Applied calculus for a detailed explanation.

Regression (Best Fit) Line

The best fit line associated with the n points (x_1,y_1), (x_2,y_2), \dots ,(x_n,y_n) has the form

    y = mx + b
    \text{Slope} = m = \frac{n\sum xy - \big(\sum x\big)\big(\sum y\big)}{n\sum (x^2) - \big(\sum x\big)^2}

    \text{Intercept} = b = \frac{\sum y - m\big(\sum x\big) }{n}

Here, \Sigma means "the sum of." Thus

    \begin{align*}&\sum xy = \text{ sum of products }= x_1y_1+x_2y_2 + \dots + x_ny_n\\ &\sum x = \text{ sum of values of }x = x_1+x_2+\dots +x_n\\ &\sum y = \text{ sum of values of }y = y_1+y_2+\dots +y_n\\ &\sum x^2 = \text{ sum of values of }x^2 = x_1^2+x_2^2+\dots +x_n^2\\ \end{align*}

Using the formula above is easy, as the following example shows.

 Example 2: Computing a Regression Line by Hand

Find the least squares line associated with the following data:

\pmb x1234
\pmb y1.

Solution In order to apply the formula, it is best to organize the data in a table as shown. (When you have filled in the values of xy and x^2 correctly, press "Sums" to obtain the sum of each column.)

\pmb x\pmb y\pmb {xy}\pmb {x^2} 
\sum x = 10\sum y = 8.2 \sum xy =       \sum x^2 =      

Substituting the correct values from the above table into the formula gives

Thus our least squares line is

Before we go on... Here is a plot the data points and the least squares line:

Notice that the line doesn't pass through even one of the points, and yet it is the straight line that best approximates them.

Let us now return to the data on demand for real estate with which we began this topic.

 Example 3: Demand Function

Find a linear demand equation that best fits the following data, and use it to predict annual sales of homes priced at $140,000.

Price (Thousands of $) 160 180 200 220 240 260 280
Sales of New Homes This Year 126 103 82 75 82 40 20

Solution Here is a table like the one we used above to organize the calculations.

\pmb x\pmb y\pmb {xy}\pmb {x^2}
\sum x = 1540\sum y = 528 \sum xy = 107,280 \sum x^2 = 350,000

Substituting these values in the formula gives (n = 7)

Notice that we used the most accurate value, m \approx -0.7928571429, that we could obtain on our calculator in the formula for b rather than the rounded value of -0.7929. This illustrates the following important general guideline:

When calculating, never round intermediate results. Rather, use the most accurate results obtainable, using the values stored on your calculator or computer if possible.

Thus the regression line is

We can now use this equation to predict the annual sales of homes priced at $140,000:

Before we go on... Here is the original data, together with the least squares line.

Q If the given data points all happen to lie on a straight line, is this the line we get by the best fit method?
A Yes. If gte given points all lie on a line, then the smallest possible value of SSE is zero, attained by the line that passes through all the points. This has the following implication: you can use linear regression on a graphing calculator or the regression tool on this web site to check your calculations of the equation of a straight line passing through two specified points.

Q If the given points do not lie on a straight line, is there a way we can tell how far off they are from lying on a straight line?
A There is a way of measuring the "goodness of fit" of the least squares line, called the coefficient of correlation. This is a number r between -1 and 1. the closer it is to -1 or 1, the better the fit. For an exact fit, we would have r = -1 for a negative slope line or r = 1 for a positive slope line. For a bad fit, we would have r close to 0. The figure below shows several collections of data points with their regression lines and corresponding values of r.

The correlation coefficient can be calculated with the following formula. (To justify this formula requires a fair knowledge of statistics, so we shall not attempt do so here.)

Coefficient of Correlation



\text{Coefficient of Correlation } = r = \frac{n\bigl(\sum xy\bigr) - \bigl(\sum x\bigr)\bigl(\sum y\bigr)}{\sqrt{n\bigl(\sum x^2\bigr) - \bigl(\sum x\bigr)^2}\cdot \sqrt{n\bigl(\sum y^2\bigr) - \bigl(\sum y\bigr)^2}}

2. Best Fit Exponential Curve (Regression Exponential Curve)

Q Now we know how to fit a straight line to given data. What about an exponential curve of the form

A The idea is to convert an exponential curve to a linear one using logarithms, as follows:

Start with the exponential function

and take the logarithm of both sides:

The properties of logarithms give

This expresses \log y as a linear function of x, with

Therefore, if we find the best-fit line using \log y as a function of x, the slope and intercept will be given as above, and so we can obtain the coefficients r and A by

To summarize,

Exponential Regression

To obtain a best-fit exponential curve of the form

    y = Ar^x
  1. Find the regression line for the data (x, \log y).
  2. The desired coefficients A and r are then
      \begin{align*} r &= 10^m\\ A &= 10^b \end{align*}
    where m and b are the slope and intercept of the regression line.

 Ejemplo 4: Sales of Compaq

Revenues from sales of Compaq computers (a brand now extinct) are shown in the following table, where t represents time in years since 1990.* Obtain an exponential regression model for the data.

\pmb t = Year (1990 = 0) 0 2 4 7
\pmb R = Revenue ($ billion) 3 4 11 25

Data are rounded. Source: Company Reports/The New York Times, January 27, 1998, p. D1.

Solution Since we need to model \log R as a linear function of t, we first make a table with x = t and y = \log( R, and then calculate the regression line, y = mx + b.

\pmb{x\ (= t)} 0 2 4 7
\pmb{y\ (= \log R)} 0.477121 0.602060 1.04139 1.39794

Instead of doing this calculation by hand as we did in the above examples, you can do it automatically using the on-line regression utility on this site. Just enter the x- and y-values in the table, and press the "y = mx+b" button. (Yes, that utility does exponential regression as well, but we would like you to know how it works!)

The linear regression model we obtain is

Thus, the desired exponential model is

where r = 10^m = 10^{0.13907} \approx 1.3774, and A = 10^0.42765 \approx 2.6770.

This gives our revenue model as

Before we go on... Go to the on-line regression utility, enter the original data (before you took the logarithms) and press the "y = a(b^x)" button. What do you find?

Note: Since we have taken logarithms before doing the linear regression, it follows that the exponential regression curve does not minimize SSE for the original data; instead, it minimizes SSE for the transformed data --- that is, for the data (x, \log y). Thus, the exponential regression curve is not the best-fit curve in the "strict" sense. See the texbook "Applied Calculus" by Waner & Costenoble for a method to obtain such a best-fit curve.

3. Other Forms of Regression

At the on-line regression utility, you can also find regression curves of the following forms:

On the TI-83/84, you will find all of these, as well as the following:

Last Updated: January 2008
Copyright © 2008 Stefan Waner

Top of Page