Online text Exercises for This Topic Textbook Regression Tool Tutorial Español Finite mathematics on-line topic: linear and exponential regression Note: The mathematics in this page has been typeset using jsMath. To see the jsMath typesetting at its best you should install the jsMath TeX fonts. Click on the jsMath button at the bottom right corner of this page for more details.

We have seen how to find a linear model given two data points: We find the equation of the line that passes through them. (See the Topic Summary of Functions for some examples.) However, we often have more than two data points, and they will rarely all lie on a single straight line, but may often come close to doing so. The problem is to find the line coming closest to passing through all of the points.

## 1. Best Fit Straight Line (Regression Line)

We start with an attempt to construct a linear demand function. Suppose that your market research of real estate investments reveals the following sales figures for new homes of different prices over the past year.

 Price (Thousands of ) 160 180 200 220 240 260 280 Sales of New Homes This Year 126 103 82 75 82 40 20 We would like to use these data to construct a demand function for the real estate market. (Recall that a demand function gives demand y, measured here by annual sales, as a function of unit price, x.) Here is a plot of y versus x. The data definitely suggest a straight line, more-or-less, and hence a linear relationship between y and x. Here are several possible "straight line fits:" Q Which line best fits the data? A We would like the sales predicted by the best-fit line (predicted values) to be as close to the actual sales (observed values) as possible. The differences between the predicted values and the observed values, called the residual errors (or just residuals) appear as the vertical distances shown in the figure below. Residual = Observed value - Predicted value pronosticado Q So how do we do that? A We add up all the squares of the residual errors to get a single error, called the sum of squares error (SSE) and we choose the line that gives the smallest SSE. This line is called the best fit line, regression line, or least squares line associated with the given data. ### Example 1: Calculating SSE for a given line Suppose we wish to calculate SSE for a specific straight line, such as y = -x + 300 as shown below: we have the following table of values:  x y Observedy y Predicted\hat{y} = -x + 300 Residualy - \hat{y} 160 126 140 -14 180 103 120 -17 200 82 100 -18 220 75  240 82  260 40  280 20      So, for the line y = -x + 300  SSE = Sum of squares of residuals = - 14 - 17 - 18 - 5 + 22 + 0 + 0 = -32 Q So that is to calculate SSE for a given line. How do we obtain the equation of the best fit line; the line that gives the lowest value of SSE? A Following is the formula for the best fit straight line. To justify it requires some calculus. Consult the chapter on several variables in Applied calculus for a detailed explanation.  Regression (Best Fit) Line The best fit line associated with the n points (x_1,y_1), (x_2,y_2), \dots ,(x_n,y_n) has the form y = mx + b where \text{Slope} = m = \frac{n\sum xy - \big(\sum x\big)\big(\sum y\big)}{n\sum (x^2) - \big(\sum x\big)^2} \text{Intercept} = b = \frac{\sum y - m\big(\sum x\big) }{n} Here, \Sigma means "the sum of." Thus \begin{align*}&\sum xy = \text{ sum of products }= x_1y_1+x_2y_2 + \dots + x_ny_n\\ &\sum x = \text{ sum of values of }x = x_1+x_2+\dots +x_n\\ &\sum y = \text{ sum of values of }y = y_1+y_2+\dots +y_n\\ &\sum x^2 = \text{ sum of values of }x^2 = x_1^2+x_2^2+\dots +x_n^2\\ \end{align*} Using the formula above is easy, as the following example shows. ### Example 2: Computing a Regression Line by Hand Find the least squares line associated with the following data:  \pmb x 1 2 3 4 \pmb y 1.5 1.6 2.1 3 Solution In order to apply the formula, it is best to organize the data in a table as shown. (When you have filled in the values of xy and x^2 correctly, press "Sums" to obtain the sum of each column.)  \pmb x \pmb y \pmb {xy} \pmb {x^2} 1 1.5  2 1.6  3 2.1  4 3.0  \sum x = 10 \sum y = 8.2 \sum xy = \sum x^2 = Substituting the correct values from the above table into the formula gives \begin{align*} &\text{Pendiente} = m = \frac{n\sum xy - \big(\sum x\big)\big(\sum y\big)}{n\sum (x^2) - \big(\sum x\big)^2} = \frac{4(23) - (10)(8.2)}{4(30) - 10^2} = 0.5\\ &\\ &\text{Intersección} = b = \frac{\sum y - m\big(\sum x\big) }{n} = \frac{8.2 - (0.5)(10)}{4} = 0.8 \end{align*} Thus our least squares line is y = 0.5x + 0.8. Before we go on... Here is a plot the data points and the least squares line: Notice that the line doesn't pass through even one of the points, and yet it is the straight line that best approximates them. Let us now return to the data on demand for real estate with which we began this topic. ### Example 3: Demand Function Find a linear demand equation that best fits the following data, and use it to predict annual sales of homes priced at140,000.

 Price (Thousands of ) 160 180 200 220 240 260 280 Sales of New Homes This Year 126 103 82 75 82 40 20 Solution Here is a table like the one we used above to organize the calculations.  \pmb x \pmb y \pmb {xy} \pmb {x^2} 160 126 20,160 25,600 180 103 18,540 32,400 200 82 16,400 40,000 220 75 16,500 48,400 240 82 19,680 57,600 260 40 10,400 67,600 280 20 5,600 78,400 \sum x = 1540 \sum y = 528 \sum xy = 107,280 \sum x^2 = 350,000 Substituting these values in the formula gives (n = 7) \begin{align*} &\text{Pendiente} = m = \frac{n\sum xy - \big(\sum x\big)\big(\sum y\big)}{n\sum (x^2) - \big(\sum x\big)^2} = \frac{7(107,280) - (1540)(528)}{7(350,000) - 1540^2} \approx -0.7929\\ &\\ &\text{Intersección} = b = \frac{\sum y - m\big(\sum x\big) }{n} \approx \frac{528 - (-0.7928571429)(1540)}{7} \approx 249.9 \end{align*} Notice that we used the most accurate value, m \approx -0.7928571429, that we could obtain on our calculator in the formula for b rather than the rounded value of -0.7929. This illustrates the following important general guideline: When calculating, never round intermediate results. Rather, use the most accurate results obtainable, using the values stored on your calculator or computer if possible. Thus the regression line is y = -0.7929x + 249.9 We can now use this equation to predict the annual sales of homes priced at140,000:

 Annual sales of homes priced at 140,000 ≈ round to the nearest whole number Before we go on... Here is the original data, together with the least squares line. Q If the given data points all happen to lie on a straight line, is this the line we get by the best fit method? A Yes. If gte given points all lie on a line, then the smallest possible value of SSE is zero, attained by the line that passes through all the points. This has the following implication: you can use linear regression on a graphing calculator or the regression tool on this web site to check your calculations of the equation of a straight line passing through two specified points. Q If the given points do not lie on a straight line, is there a way we can tell how far off they are from lying on a straight line? A There is a way of measuring the "goodness of fit" of the least squares line, called the coefficient of correlation. This is a number r between -1 and 1. the closer it is to -1 or 1, the better the fit. For an exact fit, we would have r = -1 for a negative slope line or r = 1 for a positive slope line. For a bad fit, we would have r close to 0. The figure below shows several collections of data points with their regression lines and corresponding values of r. The correlation coefficient can be calculated with the following formula. (To justify this formula requires a fair knowledge of statistics, so we shall not attempt do so here.)  Coefficient of Correlation \text{Coefficient of Correlation } = r = \frac{n\bigl(\sum xy\bigr) - \bigl(\sum x\bigr)\bigl(\sum y\bigr)}{\sqrt{n\bigl(\sum x^2\bigr) - \bigl(\sum x\bigr)^2}\cdot \sqrt{n\bigl(\sum y^2\bigr) - \bigl(\sum y\bigr)^2}} ## 2. Best Fit Exponential Curve (Regression Exponential Curve) Q Now we know how to fit a straight line to given data. What about an exponential curve of the form y = Ar^x\ \text{?} A The idea is to convert an exponential curve to a linear one using logarithms, as follows: Start with the exponential function y = Ar^x and take the logarithm of both sides: \log y = \log(Ar^x) The properties of logarithms give \begin{align*} \log y &= \log A + \log r^x.\ \ \text{or}\\ \log y &= \log A + x\log r \end{align*} This expresses \log y as a linear function of x, with \begin{align*} &\text{Slope } = m = \log r\\ &\text{Intercept}= b = \log A \end{align*} Therefore, if we find the best-fit line using \log y as a function of x, the slope and intercept will be given as above, and so we can obtain the coefficients r and A by \begin{align*} r &= 10^m\\ A &= 10^b \end{align*} To summarize,  Exponential Regression To obtain a best-fit exponential curve of the form y = Ar^x Find the regression line for the data (x, \log y). The desired coefficients A and r are then \begin{align*} r &= 10^m\\ A &= 10^b \end{align*} where m and b are the slope and intercept of the regression line. ### Ejemplo 4: Sales of Compaq Revenues from sales of Compaq computers (a brand now extinct) are shown in the following table, where t represents time in years since 1990.* Obtain an exponential regression model for the data.  \pmb t = Year (1990 = 0) 0 2 4 7 \pmb R = Revenue ( billion) 3 4 11 25

Data are rounded. Source: Company Reports/The New York Times, January 27, 1998, p. D1.

Solution Since we need to model \log R as a linear function of t, we first make a table with x = t and y = \log( R, and then calculate the regression line, y = mx + b.

 \pmb{x\ (= t)} 0 2 4 7 \pmb{y\ (= \log R)} 0.477121 0.60206 1.04139 1.39794

Instead of doing this calculation by hand as we did in the above examples, you can do it automatically using the on-line regression utility on this site. Just enter the x- and y-values in the table, and press the "y = mx+b" button. (Yes, that utility does exponential regression as well, but we would like you to know how it works!)

The linear regression model we obtain is

y = 0.13907x + 0.42765.

Thus, the desired exponential model is

R = Ar^t
,
where r = 10^m = 10^{0.13907} \approx 1.3774, and A = 10^0.42765 \approx 2.6770.

This gives our revenue model as

R = 2.6770 (1.3774)^t
.

Before we go on... Go to the on-line regression utility, enter the original data (before you took the logarithms) and press the "y = a(b^x)" button. What do you find?

Note: Since we have taken logarithms before doing the linear regression, it follows that the exponential regression curve does not minimize SSE for the original data; instead, it minimizes SSE for the transformed data --- that is, for the data (x, \log y). Thus, the exponential regression curve is not the best-fit curve in the "strict" sense. See the texbook "Applied Calculus" by Waner & Costenoble for a method to obtain such a best-fit curve.

## 3. Other Forms of Regression

At the on-line regression utility, you can also find regression curves of the following forms:

\begin{align*} &y = ax^2 + bx + c &\text{(Quadratic Regression)}\\ &y = ax^3 + bx^2 + cx + d &\text{(Cubic Regression)}\\ &y = ax^b &\text{(Power Regression)} \end{align*}

On the TI-83/84, you will find all of these, as well as the following:

\begin{align*} &y = ax^4 + bx^3 + cx^2 + dx + e &\text{(Quartic Regression)}\\ &y = a\sin(bx + c) &\text{(Sine regression)} \end{align*}

Last Updated: January 2008