Regression#

Learning Objectives#

  • Understand that a regression can be used to test the strength of a relationship between variables.
  • Understand that the relationship between variables is called a hypothesis, and should be stated before running a regression.
  • Write a regression equation for a hypothesis.
  • Understand that a regression fits a line to the data.
  • Explain what a regression provides beyond a simple correlation.
  • Interpret the intercept and coefficients of a regression model
  • Explain the concept of relevant range, and determine whether a point lies inside the relevant range.
  • Perform a linear regression in Python using the statsmodels package.
  • Interpret coefficients, p-values, and R-squared.
  • Evaluate the value of an entire regression based on its coefficients, p-values, and R-sqaured.
  • Perform and evaluate a multiple regression.
  • Define categorical and indicator variables.
  • Explain how n categories can be converted into n-1 indicator variables.
  • Run a regression using categorical variables and interpret the output.

Introduction and Overview#

This chapter is about linear regression. We assume you learned about regression in a previous course and are familiar with the mathematical basics. We will build on your foundation in this chapter and focus on when to apply regression and how to interpret the output. The interpretation of regression output is paramount; as you read this chapter, we want you to think about what you can learn from a regression.

A regression is a test of the strength of the relationship between variables. Therefore, you should apply regression when you think there may be a relationship between variables in your data (this thought process is called a hypothesis). For example, in a moment we will introduce a dataset with data on cars. Say that we think that heavier cars achieve worse fuel economy than lighter cars. We can test this hypothesis using a regression; the regression will tell us the likelihood that our hypothesis is true. Now you may think that the hypothesis is obvious (i.e. that it lacks tension). This is not necessarily so. It may be that for some subsets of cars, lighter cars get worse fuel economy. For example, say the dataset contains sports cars, which are often light, and family sedans, which tend to be heavier. With such a dataset, we may find that our hypothesis is false.

Regression works by fitting the “best” line to the data. The best line is the one that minimizes the sum of squared errors in the data. This statement probably makes sense to you given your prior coursework. If it does not, then ignore it for now as we will focus on application and interpretation. Just know that the regression is trying to fit a line to the data.

Dataset for this Chapter#

The dataset for this chapter contains information on about 400 cars from the 1970’s and 1980’s. This dataset is very popular and is often used when teaching simple statistical concepts. The dataset is maintained at the University of California Irvine Machine Learning Repository (click here).

The cells below load the data file auto.csv, which accompanies this notebook.

import numpy as np
import pandas as pd
dfAuto = pd.read_csv('data/auto.csv')
dfAuto.head()
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130.0 3504 12.0 70 American chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 American buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 American plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 American amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 American ford torino

Following is a description of each of the columns:

Variable

Description

mpg

Miles per gallon (a measure of fuel economy)

cylinders

The number of cylinders in the engine

displacement

Engine size, in cubic inches

horsepower

Engine horsepower

weight

Vehicle weight, in pounds

acceleration

Time to accelerate from 0 to 60 mph, in seconds

year

Model year

origin

Origin of car: American, European, or Japanese

name

Vehicle name

A Simple, One-Variable Regression#

Let’s begin with the hypothesis that heavier cars get worse fuel economy. This hypothesis suggests a relationship between two variables in the dataset. It suggests that when weight increases, mpg should decrease.

We can perform a test of this relationship using a simple correlation. For example:

dfAuto[['mpg', 'weight']].corr()
mpg weight
mpg 1.000000 -0.831741
weight -0.831741 1.000000

Miles per gallon and weight have a correlation of -0.83. That’s very strong. It tells us that, on average, as weight increases, miles per gallon decreases, and vice versa.

Are we done? Do we need regression? What can regression tell us that correlation cannot? Correlation tells us whether the variables move together. However, say we want to build a prediction model. We want to be able to predict mpg given an estimated weight for a new car. Correlation cannot do that. And say we want to add more explanatory variables to our model; for example, we hypothesize that fuel economy is a function of weight and horsepower. Correlation definitely cannot do that. So let’s move on to regression.

Our hypothesis is that as weight increases, fuel economy will decrease. Our hypothesis implies that weight is the independent variable and that mpg is the dependent variable. In other words, weight affects mpg and not the other way around. This makes sense. Car designers have control over weight, and as weight increases, more energy is needed to accelerate a car and to overcome rolling resistance in the tires. A greater energy need implies greater fuel consumption.

If we regress mpg, the dependent variable, on weight, the independent variable, the regression will attempt to fit the following line:

\[mpg = \alpha + \beta \cdot weight\]

The regression will use the data to find the “best” values of \(\alpha\) and \(\beta\), the ones that create a line that best fits the data. Once we know those values, we can use the equation to predict fuel economy for a hypothetical car. The regression equation is our model of the process of fuel economy.

Regressing mpg on weight yields the following equation (we will show you how to perform the regression momentarily):

\[mpg = 46.3174 - 0.0077 \cdot weight\]

Let’s interpret this equation. We will begin by interpreting the value of the coefficient, \(\beta\). The regression estimated that \(\beta\) equals -0.0077. That tells us that, on average, an increase in weight of 1 pound implies a reduction in fuel economy of 0.0077 mpg. Why a reduction? Because the coefficient \(\beta\) is negative. To see this, assume a car has a weight of 3,000 pounds. If we plug 3,000 into our equation, we obtain a predicted fuel economy of \(46.3174 - 0.0077 \cdot 3000 = 23.2174\). If we increase the weight to 3,001 pounds, the predicted fuel economy is \(46.3174 - 0.0077 \cdot 3001 = 23.2097\). The difference is \(23.2097 - 23.2174 = -0.0077\), the value of the coefficient.

First, let’s ask if this matters? Is the coefficient economically meaningful? I say it is. It implies that a 100 pound reduction in weight improves fuel economy by 0.77 mpg, which is almost 1 mpg. Auto engineers and executives can then estimate the cost to reduce weight by 100 pounds, and estimate the extra amount customers would be willing to pay for 0.77 mpg.

What about the \(\alpha\) term, 46.3174? What does that tell us? That term is called the intercept. It is like a fixed cost. You can think of it as the maximum theoretical fuel economy. If a car had zero weight, in theory its fuel economy would be 46.3174. However, we caution you against this interpretation in this case. There are no vehicles with weights close to zero. In fact, the lightest car in the dataset is about 1,600 pounds. In general, you should be wary of using a regression to make predictions outside its relevant range. Let’s elaborate on this concept with two graphs. See the graphs below and, for now, don’t worry about the code behind them.

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import Slope, BoxAnnotation
from bokeh.layouts import row
output_notebook()
Loading BokehJS ...
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Create left graph
pLeft = figure(width=450, height=350, 
               title=f'MPG vs. Weight',
               y_axis_label = "Miles per Gallon", x_axis_label="Weight (pounds)")
pLeft.scatter(x=dfAuto['weight'], y=dfAuto['mpg'])

results = smf.ols(f'mpg ~ weight', data=dfAuto).fit()
slope = Slope(y_intercept=results.params['Intercept'], gradient=results.params['weight'], 
              line_color='firebrick', line_dash='dashed', line_width=2.5)
pLeft.add_layout(slope)

# Create right graph
pRight = figure(width=450, height=350, 
                title=f'MPG vs. Weight',
                y_axis_label = "Miles per Gallon", x_axis_label="Weight (pounds)", x_range=(0,5500))
pRight.scatter(x=dfAuto['weight'], y=dfAuto['mpg'])
results = smf.ols(f'mpg ~ weight', data=dfAuto).fit()
slope = Slope(y_intercept=results.params['Intercept'], gradient=results.params['weight'], 
              line_color='firebrick', line_dash='dashed', line_width=2.5)
pRight.add_layout(slope)
box = BoxAnnotation(left=1570, right=5200, fill_alpha=0.1, fill_color='red')
pRight.add_layout(box)

# Show the graphs, side by side
show(row(pLeft, pRight))

The left graph shows a scatter plot of mpg versus weight. Notice a downward trend in the points that corroborates the correlation of -0.83 that we observed earlier. The dashed red line is a trend line, or regression line, that is superimposed on the data. This line confirms the negative relationship. The term \(\alpha\), whose value the regression determined to be 46.3174, is the y-intercept of the regression line. It shows the hypothetical mpg of a car with zero weight.

The left graph shows a scatter plot with an x-axis that ranges from the smallest to the largest weight in the dataset. When we force the x-axis to start at 0 (right graph), we see that the points are clustered in a range (highlighted by the pink shading). This range is called the relevant range. Notice that there are no cars with weight less than about 1,600 pounds, and no car with weight greater than about 5,200 pounds. It is usually unwise to use a regression model to forecast outside the relevant range. The reason is that the underlying process that generates the data might change drastically outside the relevant range. For example, imagine someone made a super-light car that weighed 500 pounds. The technology would be different and therefore predicting its fuel economy with our model derived from traditional cars might not apply.

Regression in Python#

In this course, you will do regression in Python. While there are many software packages that can do regression (e.g. Excel, R, SAS, Stata, SPSS), we will use Python for continuity with the rest of the course.

There are many Python packages that can perform regression. We recommend statsmodels because it has an easy syntax for specifying your regression. To use this package for linear regression, two import statements are needed:

import statsmodels.api as sm
import statsmodels.formula.api as smf

Once those have been imported, you can run a regression with one line of code. The following code runs the regression of mpg on weight.

results = smf.ols('mpg ~ weight', data=dfAuto).fit()

Let’s examine this code before we run it. The ols method creates a regression object (note: ols stands for “ordinary least squares”, which is the fancy term for a simple regression). The arguments to ols are:

  • 'mpg ~ weight'. This tells the ols function to use the values in the mpg column as the dependent variable, and the values in the weight column as the independent variable.

  • data=dfAuto. This tells the ols function to use the DataFrame dfAuto as the source data. Notice the fit method. This tells the regression object to actually run the regression and fit a line. Finally, notice that the fit method returns the results. We saved those in a variable called result. You can use any variable name you want.

To print the results, run the command:

print(results.summary())

Run the cell below to run the regression of mpg on weight.

import statsmodels.api as sm
import statsmodels.formula.api as smf

results = smf.ols('mpg ~ weight', data=dfAuto).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.692
Model:                            OLS   Adj. R-squared:                  0.691
Method:                 Least Squares   F-statistic:                     888.9
Date:                Tue, 14 Jan 2025   Prob (F-statistic):          2.97e-103
Time:                        01:36:46   Log-Likelihood:                -1148.4
No. Observations:                 398   AIC:                             2301.
Df Residuals:                     396   BIC:                             2309.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     46.3174      0.795     58.243      0.000      44.754      47.881
weight        -0.0077      0.000    -29.814      0.000      -0.008      -0.007
==============================================================================
Omnibus:                       40.423   Durbin-Watson:                   0.797
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               56.695
Skew:                           0.713   Prob(JB):                     4.89e-13
Kurtosis:                       4.176   Cond. No.                     1.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Let’s examine the regression output. In this course, we will focus on three aspects of the regression output:

  • Coefficients

  • p-values

  • R-squared

Interpreting the Coefficients#

The intercept is the hypothesized intercept of the regression line with the y-axis (in a one-variable regression). The intercept is the predicted value of the dependent variable when all independent variables are zero. A common use for the intercept is when estimating cost functions. One tries to predict a cost using one or more cost drivers. For example, how does UPS fuel cost vary with miles driven by delivery trucks? In such a regression, the intercept would be the estimate of the fixed fuel cost.

The coefficient is an estimate of the change in the dependent variable that will occur when the independent variable changes by one unit. In the cost example, the coefficient is the estimate of the variable cost per unit, i.e. the additional fuel cost that would result from one additional mile driven.

The sign on the intercepts and coefficients are as important as their magnitudes. A negative sign, as seen in the coefficient on our mpg vs. weight regression, tells you that the dependent variable decreases as the independent variable increases.

Interpreting the p-values#

In order to understand p-values, you need to know what it means when a coefficient equals zero. If a coefficient is zero, that means that the independent variable does not affect the dependent variable. In our mpg vs. weight regression, a coefficient of zero on the weight variable would mean that mpg does not change as weight changes.

A p-value is a test of whether the coefficient is zero. A very crude way of stating it is that “the p-value is the probability that the coefficient is actually zero”. That statement implies that a high p-value indicates an unimportant independent variable, and a low p-value indicates that the independent variable is “significant”.

Many researchers use a 10% cutoff for p-values. If a p-value is less than 10%, they say the variable is significant. In our regression above, both p-values are 0.000; that means that the intercept and coefficient are not likely equal to zero, and are therefore significant. Note that regression output usually rounds p-values, so the actual values are something like 0.0000002.

Interpreting R-squared#

In our auto dataset, there are 400 cars, each with a different mpg. The R-squared tells us the fraction of the variation in mpg that can be explained by the independent variables. In this case, the R-squared is 0.692 so 69.2% of the variation in mpg can be explained by weight. We suggest you not throw out a regression just because it has low R-squared. Low R-squared usually means that there are other variables that are missing from the regression.

Note that, since R-squared is a fraction, its value must lie between 0 and 1, i.e. \(R^2 \in [0,1]\).

Interpreting the Entire Regression#

Many students use R-squared as the primary measure of the “goodness” of a regression. We caution you against that practice. Begin by looking at the p-values. Those tell you which coefficients are significant. If a variable is significant, that means that it explains something about the dependent variable. If you have one or more significant independent variables, you have learned something. The next step is to look at the coefficient signs and magnitudes. The signs tell you the direction of the effect of the independent variables. The magnitudes tell you the impact of the independent variables. If I’m trying to predict a cost and I find a significant cost driver, but its magnitude is very, very low, then perhaps that cost driver is not economically meaningful. The final criterion should be R-squared.

We also caution you that the process we just outlined is not set in stone. If your goal is to find independent variables that are associated with the dependent variable (e.g. searching for cost drivers), then the process above is a good one. For example, in finance, sometimes a regression with an R-squared of 0.01 can be valuable! If a variable is found that is correlated with stock prices, that variable can be used to earn a return (it might take lots of trading though). It doesn’t matter if the R-squared is low. That said, if your goal is to explain most of the variation in the dependent variable, and possibly use the regression for forecasting and prediction, then R-squared is more important. For example, in machine learning, the goal is forecasting accuracy. In that case, the R-squared would be paramount.

Other Numbers in the Regression#

There are numerous other metrics in the regression output above. Many of them can be useful for you. We are not going to explain them at this time, and refer you to a statistics textbook.

Multiple Regression#

Multiple regression simply means more than one independent variable in the regression. Let’s extend our regression of mpg on weight to accommodate another independent variable, horsepower. As before, we hypothesize that weight negatively affects fuel economy. We now also hypothesize that horsepower negatively affects fuel economy. We can write our regression equation as:

\[mpg = \alpha + \beta_1 \cdot weight + \beta_2 \cdot horsepower\]

Notice that the horsepower term is additive. That means that we believe that weight and horsepower are independent, and one can vary independently of the other. In reality, weight and horsepower are positively correlated (because at some point, to generate more power you need a bigger engine or additional equipment like a turbocharger). However, the correlation is not perfect. It is possible to increase horsepower while holding weight constant, or even decreasing weight. For example, different materials and technologies can be used for the engine block, crankshaft, and valves. In sum, for illustrative purposes, we will assume that weight and horsepower are independent.

To run the regression above, we use code that is almost identical to what we had above. The only change will be to the formula in the ols function. We will change that to:

'mpg ~ weight + horsepower`

This formula tells the ols method to regress mpg on weight and horsepower. Let’s run this regression and interpret the output.

results = smf.ols('mpg ~ weight + horsepower', data=dfAuto).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.706
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     467.9
Date:                Tue, 14 Jan 2025   Prob (F-statistic):          3.06e-104
Time:                        01:36:46   Log-Likelihood:                -1121.0
No. Observations:                 392   AIC:                             2248.
Df Residuals:                     389   BIC:                             2260.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     45.6402      0.793     57.540      0.000      44.081      47.200
weight        -0.0058      0.001    -11.535      0.000      -0.007      -0.005
horsepower    -0.0473      0.011     -4.267      0.000      -0.069      -0.026
==============================================================================
Omnibus:                       35.336   Durbin-Watson:                   0.858
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.973
Skew:                           0.683   Prob(JB):                     1.04e-10
Kurtosis:                       3.974   Cond. No.                     1.15e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

We interpret this regression as follows:

  • The intercept did not change much from our previous regression. That’s a good sign, as it suggests that our estimate is not sensitive to the independent variables used. Since the intercept is outside the relevant range, it does not have physical meaning here.

  • The coefficient on weight is -0.0058 and is highly significant (the p-value is well below our 10% cutoff). Our new estimate of the effect of weight is that a 1 pound increase in weight decreases mpg by 0.0058, assuming horsepower does not change. The emphasized text is important. Each coefficient must be evaluated assuming the other variables remain the same.

  • The coefficient on horsepower is -0.0473. As we hypothesized, the sign is negative. This coefficient is also highly significant (p = 0.000), so as we predicted, increasing power decreases fuel economy. We estimate that an increase in 1 horsepower decreases fuel economy by 0.0473 mpg, assuming weight does not change. The emphasized text is important. Each coefficient must be evaluated assuming the other variables remain the same.

  • The R-squared is 0.706. We can explain 70% of the variation in a car’s fuel economy with just its weight and horsepower.

Categorical and Indicator Variables#

Say that we want to know the effect of the country/continent of origin on fuel economy. Does an American car have higher fuel economy than a Japanese car with identical specifications? Does a Japanese car have higher fuel economy than a European car with identical specifications? This is called a “fixed effect”, and is entirely plausible, especially since price is not one of our independent variables. For example, if European cars command higher prices, they are able to use better technology and get higher fuel economy for the same weight and horsepower.

The variable origin tells us the country/continent of origin of each car. Let’s examine that variable more closely:

dfAuto['origin'].unique()
array(['American', 'Japanese', 'European'], dtype=object)

Wait a minute! The values in the origin column are the strings ‘American’, ‘Japanese’, and ‘European’. How do we use a string variable in a regression??!!

A string variable, or any variable that takes on discrete values is called a categorical variable since the values represent categories. In order to use the categories in the regression, we need to convert them to numbers. We could use the numbers 0, 1, and 2, to represent ‘American’, ‘Japanese’, and ‘European’. But it turns out that is a really bad idea. If we use that coding scheme, the regression will try to figure out a fixed difference between each category. Worse, that particular coding scheme forces the regression to assume that either American > Japanese > European or American < Japanese < European. That’s not what we want. We want each category to be independent.

The standard way to handle this situation is to convert the categorical variable into multiple indicator variables. An indicator variable (sometimes called a “dummy” variable) is a variable that can only assume the values 0 and 1. If there are \(n\) categories, we will create \(n-1\) indicator variables. One of the categories will serve as the base category, and there will be one indicator variable for each of the other categories. For example, say we choose American as the base category. We would create a European indicator variable and a Japanese indicator variable. To see this, consider the following table that contains 6 randomly selected rows from dfAuto:

index

mpg

weight

origin

European

Japanese

56

26.0

1955

‘American’

0

0

222

17.0

4060

‘American’

0

0

275

17.0

3140

‘European’

1

0

326

43.4

2335

‘European’

1

0

108

20.0

2279

‘Japanese’

0

1

319

31.3

2542

‘Japanese’

0

1

Look at the indicator variable columns. For the American cars, both indicator variables are zero. For the European cars, only the European indicator variable is 1. And for the Japanese cars, only the Japanese indicator variable is 1. This is how a set of \(n\) categories is coded. We create \(n-1\) indicator variables. A value of 1 in one of the indicator variables indicates that the row (observation) belongs to that category. A value of 0 in all of the indicator variables indicates that the row belongs to the baseline category. Note that we can choose any category to be the base category.

Say we want to regress mpg on weight and origin. With our new indicator variables, the regression equation will be:

\[mpg = \alpha +\beta_1 \cdot weight + \beta_2 \cdot European + \beta_3 \cdot Japanese\]

Remember that the European and Japanese variables can only be 0 or 1. For American cars, both indicator variables will be zero so the coefficients \(\beta_2\) and \(\beta_3\) do not matter; those terms will drop out of the equation for the American cars and the equation will be:

\[mpg = \alpha +\beta_1 \cdot weight + \beta_2 \cdot 0 + \beta_3 \cdot 0\]

Simplifying: $\(mpg = \alpha +\beta_1 \cdot weight\)$

For a European car, the European dummy will be 1 and the Japanese dummy will be 0. That means that, for European cars, the equation will be:

\[mpg = \alpha +\beta_1 \cdot weight + \beta_2 \cdot 1 + \beta_3 \cdot 0\]

Simplifying: $\(mpg = (\alpha + \beta_2) +\beta_1 \cdot weight\)$

This equation tells us that for European cars, the intercept will be \(\alpha + \beta_2\). What that means is that, relative to an American car with identical weight, the fuel economy of a European car will differ by \(\beta_2\). If \(\beta_2\) is positive, that means that European cars get better fuel economy than American cars, all else equal. If \(\beta_2\) is negative, that means that European cars get worse fuel economy than American cars, all else equal.

For a Japanese car, the Japanese dummy will be 1 and the European dummy will be 0. That means that, for Japanese cars, the equation will be:

\[mpg = \alpha +\beta_1 \cdot weight + \beta_2 \cdot 0 + \beta_3 \cdot 1\]

Simplifying: $\(mpg = (\alpha + \beta_3) +\beta_1 \cdot weight\)$

This equation tells us that for Japanese cars, the intercept will be \(\alpha + \beta_3\). What that means is that, relative to an American car with identical weight, the fuel economy of a Japanese car will differ by \(\beta_3\). If \(\beta_3\) is positive, that means that Japanese cars get better fuel economy than American cars, all else equal. If \(\beta_3\) is negative, that means that Japanese cars get worse fuel economy than American cars, all else equal.

What if we want to know the difference between European and Japanese cars? Simply look at the difference between \(\beta_2\) and \(\beta_3\)!

Running a Regression with Categorical Variables#

We have some good news. When running a regression with the statsmodels package, you do not have to create indicator variables. Simply tell the ols function which variables are categorical and it will create the indicator variables for you! Pretty cool, huh? Excel won’t do that. ;-)

Compared to the multiple regression code above, the only difference is that categorical variables need to be placed inside a 'C()'. To see this, consider the code below and pay particular attention to the string formula:

results = smf.ols('mpg ~ weight + C(origin)', data=dfAuto).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.702
Model:                            OLS   Adj. R-squared:                  0.699
Method:                 Least Squares   F-statistic:                     308.6
Date:                Tue, 14 Jan 2025   Prob (F-statistic):          4.86e-103
Time:                        01:36:46   Log-Likelihood:                -1142.0
No. Observations:                 398   AIC:                             2292.
Df Residuals:                     394   BIC:                             2308.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                43.6959      1.104     39.567      0.000      41.525      45.867
C(origin)[T.European]     1.2155      0.652      1.863      0.063      -0.067       2.498
C(origin)[T.Japanese]     2.3554      0.662      3.558      0.000       1.054       3.657
weight                   -0.0070      0.000    -22.059      0.000      -0.008      -0.006
==============================================================================
Omnibus:                       37.803   Durbin-Watson:                   0.813
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.615
Skew:                           0.662   Prob(JB):                     1.38e-12
Kurtosis:                       4.242   Cond. No.                     1.82e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.82e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Notice that the regression output has two lines for origin:

C(origin)[T.European]     1.2155      0.652      1.863      0.063      -0.067       2.498
C(origin)[T.Japanese]     2.3554      0.662      3.558      0.000       1.054       3.657

It appears that the regression treated the American category as the baseline. The first line of output shows the coefficient on the European indicator variable. That coefficient is 1.2155, indicating that a European car gets 1.2155 more mpg than an American car with identical weight. The second line of output shows the coefficient on the Japanese indicator variable. That coefficient is 2.3554, indicating that a Japanese car gets 2.3554 more mpg than an American car with identical weight. Both coefficients are significant (meaning it’s unlikely their true value is zero). We also learn that a Japanese car gets 2.3554 - 1.2155 = 1.1399 more mpg than a European car with identical weight.

Forecasting#

Once you have a regression equation, you can use it to forecast. For example, take one of our regression models from above:

\[mpg = 45.6402 - 0.0058 \cdot weight - 0.0473 \cdot horsepower\]

Say we want to use this model to forecast. We might do this if we are an engineer or executive designing a new car. We expect that our new car will weigh 3000 pounds and make 150 horsepower. We simply plug our estimates into our model:

\[mpg = 45.6402 - 0.0058 \cdot 3000 - 0.0473 \cdot 150\]

This model yields a prediction of 21.145 mpg.