Regression#
Learning Objectives#
- Understand that a regression can be used to test the strength of a relationship between variables.
- Understand that the relationship between variables is called a hypothesis, and should be stated before running a regression.
- Write a regression equation for a hypothesis.
- Understand that a regression fits a line to the data.
- Explain what a regression provides beyond a simple correlation.
- Interpret the intercept and coefficients of a regression model
- Explain the concept of relevant range, and determine whether a point lies inside the relevant range.
- Perform a linear regression in Python using the statsmodels package.
- Interpret coefficients, p-values, and R-squared.
- Evaluate the value of an entire regression based on its coefficients, p-values, and R-sqaured.
- Perform and evaluate a multiple regression.
- Define categorical and indicator variables.
- Explain how n categories can be converted into n-1 indicator variables.
- Run a regression using categorical variables and interpret the output.
Introduction and Overview#
This chapter is about linear regression. We assume you learned about regression in a previous course and are familiar with the mathematical basics. We will build on your foundation in this chapter and focus on when to apply regression and how to interpret the output. The interpretation of regression output is paramount; as you read this chapter, we want you to think about what you can learn from a regression.
A regression is a test of the strength of the relationship between variables. Therefore, you should apply regression when you think there may be a relationship between variables in your data (this thought process is called a hypothesis). For example, in a moment we will introduce a dataset with data on cars. Say that we think that heavier cars achieve worse fuel economy than lighter cars. We can test this hypothesis using a regression; the regression will tell us the likelihood that our hypothesis is true. Now you may think that the hypothesis is obvious (i.e. that it lacks tension). This is not necessarily so. It may be that for some subsets of cars, lighter cars get worse fuel economy. For example, say the dataset contains sports cars, which are often light, and family sedans, which tend to be heavier. With such a dataset, we may find that our hypothesis is false.
Regression works by fitting the “best” line to the data. The best line is the one that minimizes the sum of squared errors in the data. This statement probably makes sense to you given your prior coursework. If it does not, then ignore it for now as we will focus on application and interpretation. Just know that the regression is trying to fit a line to the data.
Dataset for this Chapter#
The dataset for this chapter contains information on about 400 cars from the 1970’s and 1980’s. This dataset is very popular and is often used when teaching simple statistical concepts. The dataset is maintained at the University of California Irvine Machine Learning Repository (click here).
The cells below load the data file auto.csv, which accompanies this notebook.
import numpy as np
import pandas as pd
dfAuto = pd.read_csv('data/auto.csv')
dfAuto.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model year | origin | car name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | American | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | American | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | American | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | American | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | American | ford torino |
Following is a description of each of the columns:
Variable |
Description |
---|---|
mpg |
Miles per gallon (a measure of fuel economy) |
cylinders |
The number of cylinders in the engine |
displacement |
Engine size, in cubic inches |
horsepower |
Engine horsepower |
weight |
Vehicle weight, in pounds |
acceleration |
Time to accelerate from 0 to 60 mph, in seconds |
year |
Model year |
origin |
Origin of car: American, European, or Japanese |
name |
Vehicle name |
A Simple, One-Variable Regression#
Let’s begin with the hypothesis that heavier cars get worse fuel economy. This hypothesis suggests a relationship between two variables in the dataset. It suggests that when weight
increases, mpg
should decrease.
We can perform a test of this relationship using a simple correlation. For example:
dfAuto[['mpg', 'weight']].corr()
mpg | weight | |
---|---|---|
mpg | 1.000000 | -0.831741 |
weight | -0.831741 | 1.000000 |
Miles per gallon and weight have a correlation of -0.83. That’s very strong. It tells us that, on average, as weight increases, miles per gallon decreases, and vice versa.
Are we done? Do we need regression? What can regression tell us that correlation cannot? Correlation tells us whether the variables move together. However, say we want to build a prediction model. We want to be able to predict mpg given an estimated weight for a new car. Correlation cannot do that. And say we want to add more explanatory variables to our model; for example, we hypothesize that fuel economy is a function of weight and horsepower. Correlation definitely cannot do that. So let’s move on to regression.
Our hypothesis is that as weight increases, fuel economy will decrease. Our hypothesis implies that weight is the independent variable and that mpg is the dependent variable. In other words, weight affects mpg and not the other way around. This makes sense. Car designers have control over weight, and as weight increases, more energy is needed to accelerate a car and to overcome rolling resistance in the tires. A greater energy need implies greater fuel consumption.
If we regress mpg, the dependent variable, on weight, the independent variable, the regression will attempt to fit the following line:
The regression will use the data to find the “best” values of \(\alpha\) and \(\beta\), the ones that create a line that best fits the data. Once we know those values, we can use the equation to predict fuel economy for a hypothetical car. The regression equation is our model of the process of fuel economy.
Regressing mpg on weight yields the following equation (we will show you how to perform the regression momentarily):
Let’s interpret this equation. We will begin by interpreting the value of the coefficient, \(\beta\). The regression estimated that \(\beta\) equals -0.0077. That tells us that, on average, an increase in weight of 1 pound implies a reduction in fuel economy of 0.0077 mpg. Why a reduction? Because the coefficient \(\beta\) is negative. To see this, assume a car has a weight of 3,000 pounds. If we plug 3,000 into our equation, we obtain a predicted fuel economy of \(46.3174 - 0.0077 \cdot 3000 = 23.2174\). If we increase the weight to 3,001 pounds, the predicted fuel economy is \(46.3174 - 0.0077 \cdot 3001 = 23.2097\). The difference is \(23.2097 - 23.2174 = -0.0077\), the value of the coefficient.
First, let’s ask if this matters? Is the coefficient economically meaningful? I say it is. It implies that a 100 pound reduction in weight improves fuel economy by 0.77 mpg, which is almost 1 mpg. Auto engineers and executives can then estimate the cost to reduce weight by 100 pounds, and estimate the extra amount customers would be willing to pay for 0.77 mpg.
What about the \(\alpha\) term, 46.3174? What does that tell us? That term is called the intercept. It is like a fixed cost. You can think of it as the maximum theoretical fuel economy. If a car had zero weight, in theory its fuel economy would be 46.3174. However, we caution you against this interpretation in this case. There are no vehicles with weights close to zero. In fact, the lightest car in the dataset is about 1,600 pounds. In general, you should be wary of using a regression to make predictions outside its relevant range. Let’s elaborate on this concept with two graphs. See the graphs below and, for now, don’t worry about the code behind them.
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import Slope, BoxAnnotation
from bokeh.layouts import row
output_notebook()
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Create left graph
pLeft = figure(width=450, height=350,
title=f'MPG vs. Weight',
y_axis_label = "Miles per Gallon", x_axis_label="Weight (pounds)")
pLeft.scatter(x=dfAuto['weight'], y=dfAuto['mpg'])
results = smf.ols(f'mpg ~ weight', data=dfAuto).fit()
slope = Slope(y_intercept=results.params['Intercept'], gradient=results.params['weight'],
line_color='firebrick', line_dash='dashed', line_width=2.5)
pLeft.add_layout(slope)
# Create right graph
pRight = figure(width=450, height=350,
title=f'MPG vs. Weight',
y_axis_label = "Miles per Gallon", x_axis_label="Weight (pounds)", x_range=(0,5500))
pRight.scatter(x=dfAuto['weight'], y=dfAuto['mpg'])
results = smf.ols(f'mpg ~ weight', data=dfAuto).fit()
slope = Slope(y_intercept=results.params['Intercept'], gradient=results.params['weight'],
line_color='firebrick', line_dash='dashed', line_width=2.5)
pRight.add_layout(slope)
box = BoxAnnotation(left=1570, right=5200, fill_alpha=0.1, fill_color='red')
pRight.add_layout(box)
# Show the graphs, side by side
show(row(pLeft, pRight))
The left graph shows a scatter plot of mpg versus weight. Notice a downward trend in the points that corroborates the correlation of -0.83 that we observed earlier. The dashed red line is a trend line, or regression line, that is superimposed on the data. This line confirms the negative relationship. The term \(\alpha\), whose value the regression determined to be 46.3174, is the y-intercept of the regression line. It shows the hypothetical mpg of a car with zero weight.
The left graph shows a scatter plot with an x-axis that ranges from the smallest to the largest weight in the dataset. When we force the x-axis to start at 0 (right graph), we see that the points are clustered in a range (highlighted by the pink shading). This range is called the relevant range. Notice that there are no cars with weight less than about 1,600 pounds, and no car with weight greater than about 5,200 pounds. It is usually unwise to use a regression model to forecast outside the relevant range. The reason is that the underlying process that generates the data might change drastically outside the relevant range. For example, imagine someone made a super-light car that weighed 500 pounds. The technology would be different and therefore predicting its fuel economy with our model derived from traditional cars might not apply.
Regression in Python#
In this course, you will do regression in Python. While there are many software packages that can do regression (e.g. Excel, R, SAS, Stata, SPSS), we will use Python for continuity with the rest of the course.
There are many Python packages that can perform regression. We recommend statsmodels because it has an easy syntax for specifying your regression. To use this package for linear regression, two import statements are needed:
import statsmodels.api as sm
import statsmodels.formula.api as smf
Once those have been imported, you can run a regression with one line of code. The following code runs the regression of mpg on weight.
results = smf.ols('mpg ~ weight', data=dfAuto).fit()
Let’s examine this code before we run it. The ols
method creates a regression object (note: ols stands for “ordinary least squares”, which is the fancy term for a simple regression). The arguments to ols are:
'mpg ~ weight'
. This tells theols
function to use the values in the mpg column as the dependent variable, and the values in the weight column as the independent variable.data=dfAuto
. This tells theols
function to use the DataFramedfAuto
as the source data. Notice thefit
method. This tells the regression object to actually run the regression and fit a line. Finally, notice that thefit
method returns the results. We saved those in a variable calledresult
. You can use any variable name you want.
To print the results, run the command:
print(results.summary())
Run the cell below to run the regression of mpg on weight.
import statsmodels.api as sm
import statsmodels.formula.api as smf
results = smf.ols('mpg ~ weight', data=dfAuto).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.692
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 888.9
Date: Tue, 14 Jan 2025 Prob (F-statistic): 2.97e-103
Time: 01:36:46 Log-Likelihood: -1148.4
No. Observations: 398 AIC: 2301.
Df Residuals: 396 BIC: 2309.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 46.3174 0.795 58.243 0.000 44.754 47.881
weight -0.0077 0.000 -29.814 0.000 -0.008 -0.007
==============================================================================
Omnibus: 40.423 Durbin-Watson: 0.797
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.695
Skew: 0.713 Prob(JB): 4.89e-13
Kurtosis: 4.176 Cond. No. 1.13e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Let’s examine the regression output. In this course, we will focus on three aspects of the regression output:
Coefficients
p-values
R-squared
Interpreting the Coefficients#
The intercept is the hypothesized intercept of the regression line with the y-axis (in a one-variable regression). The intercept is the predicted value of the dependent variable when all independent variables are zero. A common use for the intercept is when estimating cost functions. One tries to predict a cost using one or more cost drivers. For example, how does UPS fuel cost vary with miles driven by delivery trucks? In such a regression, the intercept would be the estimate of the fixed fuel cost.
The coefficient is an estimate of the change in the dependent variable that will occur when the independent variable changes by one unit. In the cost example, the coefficient is the estimate of the variable cost per unit, i.e. the additional fuel cost that would result from one additional mile driven.
The sign on the intercepts and coefficients are as important as their magnitudes. A negative sign, as seen in the coefficient on our mpg vs. weight regression, tells you that the dependent variable decreases as the independent variable increases.
Interpreting the p-values#
In order to understand p-values, you need to know what it means when a coefficient equals zero. If a coefficient is zero, that means that the independent variable does not affect the dependent variable. In our mpg vs. weight regression, a coefficient of zero on the weight variable would mean that mpg does not change as weight changes.
A p-value is a test of whether the coefficient is zero. A very crude way of stating it is that “the p-value is the probability that the coefficient is actually zero”. That statement implies that a high p-value indicates an unimportant independent variable, and a low p-value indicates that the independent variable is “significant”.
Many researchers use a 10% cutoff for p-values. If a p-value is less than 10%, they say the variable is significant. In our regression above, both p-values are 0.000; that means that the intercept and coefficient are not likely equal to zero, and are therefore significant. Note that regression output usually rounds p-values, so the actual values are something like 0.0000002.
Interpreting R-squared#
In our auto dataset, there are 400 cars, each with a different mpg. The R-squared tells us the fraction of the variation in mpg that can be explained by the independent variables. In this case, the R-squared is 0.692 so 69.2% of the variation in mpg can be explained by weight. We suggest you not throw out a regression just because it has low R-squared. Low R-squared usually means that there are other variables that are missing from the regression.
Note that, since R-squared is a fraction, its value must lie between 0 and 1, i.e. \(R^2 \in [0,1]\).
Interpreting the Entire Regression#
Many students use R-squared as the primary measure of the “goodness” of a regression. We caution you against that practice. Begin by looking at the p-values. Those tell you which coefficients are significant. If a variable is significant, that means that it explains something about the dependent variable. If you have one or more significant independent variables, you have learned something. The next step is to look at the coefficient signs and magnitudes. The signs tell you the direction of the effect of the independent variables. The magnitudes tell you the impact of the independent variables. If I’m trying to predict a cost and I find a significant cost driver, but its magnitude is very, very low, then perhaps that cost driver is not economically meaningful. The final criterion should be R-squared.
We also caution you that the process we just outlined is not set in stone. If your goal is to find independent variables that are associated with the dependent variable (e.g. searching for cost drivers), then the process above is a good one. For example, in finance, sometimes a regression with an R-squared of 0.01 can be valuable! If a variable is found that is correlated with stock prices, that variable can be used to earn a return (it might take lots of trading though). It doesn’t matter if the R-squared is low. That said, if your goal is to explain most of the variation in the dependent variable, and possibly use the regression for forecasting and prediction, then R-squared is more important. For example, in machine learning, the goal is forecasting accuracy. In that case, the R-squared would be paramount.
Other Numbers in the Regression#
There are numerous other metrics in the regression output above. Many of them can be useful for you. We are not going to explain them at this time, and refer you to a statistics textbook.
Multiple Regression#
Multiple regression simply means more than one independent variable in the regression. Let’s extend our regression of mpg on weight to accommodate another independent variable, horsepower. As before, we hypothesize that weight negatively affects fuel economy. We now also hypothesize that horsepower negatively affects fuel economy. We can write our regression equation as:
Notice that the horsepower term is additive. That means that we believe that weight and horsepower are independent, and one can vary independently of the other. In reality, weight and horsepower are positively correlated (because at some point, to generate more power you need a bigger engine or additional equipment like a turbocharger). However, the correlation is not perfect. It is possible to increase horsepower while holding weight constant, or even decreasing weight. For example, different materials and technologies can be used for the engine block, crankshaft, and valves. In sum, for illustrative purposes, we will assume that weight and horsepower are independent.
To run the regression above, we use code that is almost identical to what we had above. The only change will be to the formula in the ols
function. We will change that to:
'mpg ~ weight + horsepower`
This formula tells the ols
method to regress mpg on weight and horsepower. Let’s run this regression and interpret the output.
results = smf.ols('mpg ~ weight + horsepower', data=dfAuto).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.706
Model: OLS Adj. R-squared: 0.705
Method: Least Squares F-statistic: 467.9
Date: Tue, 14 Jan 2025 Prob (F-statistic): 3.06e-104
Time: 01:36:46 Log-Likelihood: -1121.0
No. Observations: 392 AIC: 2248.
Df Residuals: 389 BIC: 2260.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 45.6402 0.793 57.540 0.000 44.081 47.200
weight -0.0058 0.001 -11.535 0.000 -0.007 -0.005
horsepower -0.0473 0.011 -4.267 0.000 -0.069 -0.026
==============================================================================
Omnibus: 35.336 Durbin-Watson: 0.858
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45.973
Skew: 0.683 Prob(JB): 1.04e-10
Kurtosis: 3.974 Cond. No. 1.15e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
We interpret this regression as follows:
The intercept did not change much from our previous regression. That’s a good sign, as it suggests that our estimate is not sensitive to the independent variables used. Since the intercept is outside the relevant range, it does not have physical meaning here.
The coefficient on weight is -0.0058 and is highly significant (the p-value is well below our 10% cutoff). Our new estimate of the effect of weight is that a 1 pound increase in weight decreases mpg by 0.0058, assuming horsepower does not change. The emphasized text is important. Each coefficient must be evaluated assuming the other variables remain the same.
The coefficient on horsepower is -0.0473. As we hypothesized, the sign is negative. This coefficient is also highly significant (p = 0.000), so as we predicted, increasing power decreases fuel economy. We estimate that an increase in 1 horsepower decreases fuel economy by 0.0473 mpg, assuming weight does not change. The emphasized text is important. Each coefficient must be evaluated assuming the other variables remain the same.
The R-squared is 0.706. We can explain 70% of the variation in a car’s fuel economy with just its weight and horsepower.
Categorical and Indicator Variables#
Say that we want to know the effect of the country/continent of origin on fuel economy. Does an American car have higher fuel economy than a Japanese car with identical specifications? Does a Japanese car have higher fuel economy than a European car with identical specifications? This is called a “fixed effect”, and is entirely plausible, especially since price is not one of our independent variables. For example, if European cars command higher prices, they are able to use better technology and get higher fuel economy for the same weight and horsepower.
The variable origin tells us the country/continent of origin of each car. Let’s examine that variable more closely:
dfAuto['origin'].unique()
array(['American', 'Japanese', 'European'], dtype=object)
Wait a minute! The values in the origin column are the strings ‘American’, ‘Japanese’, and ‘European’. How do we use a string variable in a regression??!!
A string variable, or any variable that takes on discrete values is called a categorical variable since the values represent categories. In order to use the categories in the regression, we need to convert them to numbers. We could use the numbers 0, 1, and 2, to represent ‘American’, ‘Japanese’, and ‘European’. But it turns out that is a really bad idea. If we use that coding scheme, the regression will try to figure out a fixed difference between each category. Worse, that particular coding scheme forces the regression to assume that either American > Japanese > European or American < Japanese < European. That’s not what we want. We want each category to be independent.
The standard way to handle this situation is to convert the categorical variable into multiple indicator variables. An indicator variable (sometimes called a “dummy” variable) is a variable that can only assume the values 0 and 1. If there are \(n\) categories, we will create \(n-1\) indicator variables. One of the categories will serve as the base category, and there will be one indicator variable for each of the other categories. For example, say we choose American as the base category. We would create a European indicator variable and a Japanese indicator variable. To see this, consider the following table that contains 6 randomly selected rows from dfAuto
:
index |
mpg |
weight |
origin |
European |
Japanese |
---|---|---|---|---|---|
56 |
26.0 |
1955 |
‘American’ |
0 |
0 |
222 |
17.0 |
4060 |
‘American’ |
0 |
0 |
275 |
17.0 |
3140 |
‘European’ |
1 |
0 |
326 |
43.4 |
2335 |
‘European’ |
1 |
0 |
108 |
20.0 |
2279 |
‘Japanese’ |
0 |
1 |
319 |
31.3 |
2542 |
‘Japanese’ |
0 |
1 |
Look at the indicator variable columns. For the American cars, both indicator variables are zero. For the European cars, only the European indicator variable is 1. And for the Japanese cars, only the Japanese indicator variable is 1. This is how a set of \(n\) categories is coded. We create \(n-1\) indicator variables. A value of 1 in one of the indicator variables indicates that the row (observation) belongs to that category. A value of 0 in all of the indicator variables indicates that the row belongs to the baseline category. Note that we can choose any category to be the base category.
Say we want to regress mpg on weight and origin. With our new indicator variables, the regression equation will be:
Remember that the European and Japanese variables can only be 0 or 1. For American cars, both indicator variables will be zero so the coefficients \(\beta_2\) and \(\beta_3\) do not matter; those terms will drop out of the equation for the American cars and the equation will be:
Simplifying: $\(mpg = \alpha +\beta_1 \cdot weight\)$
For a European car, the European dummy will be 1 and the Japanese dummy will be 0. That means that, for European cars, the equation will be:
Simplifying: $\(mpg = (\alpha + \beta_2) +\beta_1 \cdot weight\)$
This equation tells us that for European cars, the intercept will be \(\alpha + \beta_2\). What that means is that, relative to an American car with identical weight, the fuel economy of a European car will differ by \(\beta_2\). If \(\beta_2\) is positive, that means that European cars get better fuel economy than American cars, all else equal. If \(\beta_2\) is negative, that means that European cars get worse fuel economy than American cars, all else equal.
For a Japanese car, the Japanese dummy will be 1 and the European dummy will be 0. That means that, for Japanese cars, the equation will be:
Simplifying: $\(mpg = (\alpha + \beta_3) +\beta_1 \cdot weight\)$
This equation tells us that for Japanese cars, the intercept will be \(\alpha + \beta_3\). What that means is that, relative to an American car with identical weight, the fuel economy of a Japanese car will differ by \(\beta_3\). If \(\beta_3\) is positive, that means that Japanese cars get better fuel economy than American cars, all else equal. If \(\beta_3\) is negative, that means that Japanese cars get worse fuel economy than American cars, all else equal.
What if we want to know the difference between European and Japanese cars? Simply look at the difference between \(\beta_2\) and \(\beta_3\)!
Running a Regression with Categorical Variables#
We have some good news. When running a regression with the statsmodels
package, you do not have to create indicator variables. Simply tell the ols
function which variables are categorical and it will create the indicator variables for you! Pretty cool, huh? Excel won’t do that. ;-)
Compared to the multiple regression code above, the only difference is that categorical variables need to be placed inside a 'C()'
. To see this, consider the code below and pay particular attention to the string formula:
results = smf.ols('mpg ~ weight + C(origin)', data=dfAuto).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.702
Model: OLS Adj. R-squared: 0.699
Method: Least Squares F-statistic: 308.6
Date: Tue, 14 Jan 2025 Prob (F-statistic): 4.86e-103
Time: 01:36:46 Log-Likelihood: -1142.0
No. Observations: 398 AIC: 2292.
Df Residuals: 394 BIC: 2308.
Df Model: 3
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 43.6959 1.104 39.567 0.000 41.525 45.867
C(origin)[T.European] 1.2155 0.652 1.863 0.063 -0.067 2.498
C(origin)[T.Japanese] 2.3554 0.662 3.558 0.000 1.054 3.657
weight -0.0070 0.000 -22.059 0.000 -0.008 -0.006
==============================================================================
Omnibus: 37.803 Durbin-Watson: 0.813
Prob(Omnibus): 0.000 Jarque-Bera (JB): 54.615
Skew: 0.662 Prob(JB): 1.38e-12
Kurtosis: 4.242 Cond. No. 1.82e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.82e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Notice that the regression output has two lines for origin:
C(origin)[T.European] 1.2155 0.652 1.863 0.063 -0.067 2.498
C(origin)[T.Japanese] 2.3554 0.662 3.558 0.000 1.054 3.657
It appears that the regression treated the American category as the baseline. The first line of output shows the coefficient on the European indicator variable. That coefficient is 1.2155, indicating that a European car gets 1.2155 more mpg than an American car with identical weight. The second line of output shows the coefficient on the Japanese indicator variable. That coefficient is 2.3554, indicating that a Japanese car gets 2.3554 more mpg than an American car with identical weight. Both coefficients are significant (meaning it’s unlikely their true value is zero). We also learn that a Japanese car gets 2.3554 - 1.2155 = 1.1399 more mpg than a European car with identical weight.
Forecasting#
Once you have a regression equation, you can use it to forecast. For example, take one of our regression models from above:
Say we want to use this model to forecast. We might do this if we are an engineer or executive designing a new car. We expect that our new car will weigh 3000 pounds and make 150 horsepower. We simply plug our estimates into our model:
This model yields a prediction of 21.145 mpg.