Finite mathematics online topic: linear and exponential regression 
Note: The mathematics in this page has been typeset using jsMath. To see the jsMath typesetting at its best you should install the jsMath TeX fonts. Click on the jsMath button at the bottom right corner of this page for more details.
We have seen how to find a linear model given two data points: We find the equation of the line that passes through them. (See the Topic Summary of Functions for some examples.) However, we often have more than two data points, and they will rarely all lie on a single straight line, but may often come close to doing so. The problem is to find the line coming closest to passing through all of the points.
We start with an attempt to construct a linear demand function. Suppose that your market research of real estate investments reveals the following sales figures for new homes of different prices over the past year.
Price (Thousands of $)  160  180  200  220  240  260  280 
Sales of New Homes This Year  126  103  82  75  82  40  20 
We would like to use these data to construct a demand function for the real estate market. (Recall that a demand function gives demand y, measured here by annual sales, as a function of unit price, x.) Here is a plot of y versus x.
The data definitely suggest a straight line, moreorless, and hence a linear relationship between y and x. Here are several possible "straight line fits:"
Q Which line best fits the data?
A We would like the sales predicted by the bestfit line (predicted values) to be as close to the actual sales (observed values) as possible. The differences between the predicted values and the observed values, called the residual errors (or just residuals) appear as the vertical distances shown in the figure below.
Q So how do we do that?
A We add up all the squares of the residual errors to get a single error, called the sum of squares error (SSE) and we choose the line that gives the smallest SSE. This line is called the best fit line, regression line, or least squares line associated with the given data.
So, for the line y = x + 300
SSE  =  Sum of squares of residuals 
=   14  17  18  5 + 22 + 0 + 0  
=  32 
Q So that is to calculate SSE for a given line. How do we obtain the equation of the best fit line; the line that gives the lowest value of SSE?
A Following is the formula for the best fit straight line. To justify it requires some calculus. Consult the chapter on several variables in Applied calculus for a detailed explanation.
Regression (Best Fit) Line
The best fit line associated with the n points (x_1,y_1), (x_2,y_2), \dots ,(x_n,y_n) has the form
\text{Slope} = m = \frac{n\sum xy  \big(\sum x\big)\big(\sum y\big)}{n\sum (x^2)  \big(\sum x\big)^2}
\text{Intercept} = b = \frac{\sum y  m\big(\sum x\big) }{n}
Here, \Sigma means "the sum of." Thus \begin{align*}&\sum xy = \text{ sum of products }= x_1y_1+x_2y_2 + \dots + x_ny_n\\
&\sum x = \text{ sum of values of }x = x_1+x_2+\dots +x_n\\
&\sum y = \text{ sum of values of }y = y_1+y_2+\dots +y_n\\
&\sum x^2 = \text{ sum of values of }x^2 = x_1^2+x_2^2+\dots +x_n^2\\
\end{align*}

Using the formula above is easy, as the following example shows.
\pmb x  1  2  3  4 
\pmb y  1.5  1.6  2.1  3.0 
Solution In order to apply the formula, it is best to organize the data in a table as shown. (When you have filled in the values of xy and x^2 correctly, press "Sums" to obtain the sum of each column.)
Substituting the correct values from the above table into the formula gives
Thus our least squares line is
Before we go on... Here is a plot the data points and the least squares line:
Notice that the line doesn't pass through even one of the points, and yet it is the straight line that best approximates them.
Let us now return to the data on demand for real estate with which we began this topic.
Find a linear demand equation that best fits the following data, and use it to predict annual sales of homes priced at $140,000.
Price (Thousands of $)  160  180  200  220  240  260  280 
Sales of New Homes This Year  126  103  82  75  82  40  20 
Solution Here is a table like the one we used above to organize the calculations.
\pmb x  \pmb y  \pmb {xy}  \pmb {x^2} 
160  126  20,160  25,600 
180  103  18,540  32,400 
200  82  16,400  40,000 
220  75  16,500  48,400 
240  82  19,680  57,600 
260  40  10,400  67,600 
280  20  5,600  78,400 
\sum x = 1540  \sum y = 528  \sum xy = 107,280  \sum x^2 = 350,000 
Substituting these values in the formula gives (n = 7)
Notice that we used the most accurate value, m \approx 0.7928571429, that we could obtain on our calculator in the formula for b rather than the rounded value of 0.7929. This illustrates the following important general guideline:
Thus the regression line is
We can now use this equation to predict the annual sales of homes priced at $140,000:
Before we go on... Here is the original data, together with the least squares line.
Q If the given data points all happen to lie on a straight line, is this the line we get by the best fit method?
A Yes. If gte given points all lie on a line, then the smallest possible value of SSE is zero, attained by the line that passes through all the points. This has the following implication: you can use linear regression on a graphing calculator or the regression tool on this web site to check your calculations of the equation of a straight line passing through two specified points.
Q If the given points do not lie on a straight line, is there a way we can tell how far off they are from lying on a straight line?
A There is a way of measuring the "goodness of fit" of the least squares line, called the coefficient of correlation. This is a number r between 1 and 1. the closer it is to 1 or 1, the better the fit. For an exact fit, we would have r = 1 for a negative slope line or r = 1 for a positive slope line. For a bad fit, we would have r close to 0. The figure below shows several collections of data points with their regression lines and corresponding values of r.
The correlation coefficient can be calculated with the following formula. (To justify this formula requires a fair knowledge of statistics, so we shall not attempt do so here.)
Coefficient of Correlation
\text{Coefficient of Correlation }
= r = \frac{n\bigl(\sum xy\bigr)  \bigl(\sum x\bigr)\bigl(\sum y\bigr)}{\sqrt{n\bigl(\sum x^2\bigr)  \bigl(\sum x\bigr)^2}\cdot \sqrt{n\bigl(\sum y^2\bigr)  \bigl(\sum y\bigr)^2}}

Q Now we know how to fit a straight line to given data. What about an exponential curve of the form
Start with the exponential function
and take the logarithm of both sides:
The properties of logarithms give
This expresses \log y as a linear function of x, with
Therefore, if we find the bestfit line using \log y as a function of x, the slope and intercept will be given as above, and so we can obtain the coefficients r and A by
To summarize,
Exponential Regression
To obtain a bestfit exponential curve of the form y = Ar^x

Revenues from sales of Compaq computers (a brand now extinct) are shown in the following table, where t represents time in years since 1990.* Obtain an exponential regression model for the data.
\pmb t = Year (1990 = 0)  0  2  4  7 
\pmb R = Revenue ($ billion)  3  4  11  25 
Data are rounded. Source: Company Reports/The New York Times, January 27, 1998, p. D1.
Solution Since we need to model \log R as a linear function of t, we first make a table with x = t and y = \log( R, and then calculate the regression line, y = mx + b.
\pmb{x\ (= t)}  0  2  4  7 
\pmb{y\ (= \log R)}  0.477121  0.602060  1.04139  1.39794 
Instead of doing this calculation by hand as we did in the above examples, you can do it automatically using the online regression utility on this site. Just enter the x and yvalues in the table, and press the "y = mx+b" button. (Yes, that utility does exponential regression as well, but we would like you to know how it works!)
The linear regression model we obtain is
Thus, the desired exponential model is
This gives our revenue model as
Before we go on... Go to the online regression utility, enter the original data (before you took the logarithms) and press the "y = a(b^x)" button. What do you find?
Note: Since we have taken logarithms before doing the linear regression, it follows that the exponential regression curve does not minimize SSE for the original data; instead, it minimizes SSE for the transformed data  that is, for the data (x, \log y). Thus, the exponential regression curve is not the bestfit curve in the "strict" sense. See the texbook "Applied Calculus" by Waner & Costenoble for a method to obtain such a bestfit curve.
On the TI83/84, you will find all of these, as well as the following:
Last Updated: January 2008
Copyright © 2008 Stefan Waner