Regression Details¶
This notebook will explore a few additional facets of linear regression with a single variable. Begin by loading the CEOs data from the wooldridge
model. We’ll print out a head of the DataFrame to see a snapshot of what the data looks like.
salary | pcsalary | sales | roe | pcroe | ros | indus | finance | consprod | utility | lsalary | lsales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 |
mean | 1281.119617 | 13.282297 | 6923.793282 | 17.184211 | 10.800478 | 61.803828 | 0.320574 | 0.220096 | 0.287081 | 0.172249 | 6.950386 | 8.292265 |
std | 1372.345308 | 32.633921 | 10633.271088 | 8.518509 | 97.219399 | 68.177052 | 0.467818 | 0.415306 | 0.453486 | 0.378503 | 0.566374 | 1.013161 |
min | 223.000000 | -61.000000 | 175.199997 | 0.500000 | -98.900002 | -58.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.407172 | 5.165928 |
25% | 736.000000 | -1.000000 | 2210.300049 | 12.400000 | -21.200001 | 21.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.601230 | 7.700883 |
50% | 1039.000000 | 9.000000 | 3705.199951 | 15.500000 | -3.000000 | 52.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.946014 | 8.217492 |
75% | 1407.000000 | 20.000000 | 7177.000000 | 20.000000 | 19.500000 | 81.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 7.249215 | 8.878636 |
max | 14822.000000 | 212.000000 | 97649.898438 | 56.299999 | 977.000000 | 418.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 9.603868 | 11.489144 |
We’ve already computed ordinary least squares regression coefficients “by hand” once, which is a rite of passage that will surely stay with you for the rest of your life. But now that we’ve seen how it’s done, it’s time to speed things up by asking Python to do the regression for us.
The module we’ll use for this is statsmodels
. Within statsmodels
, the forumla.api
suite of tools will allow us to perform linear regression with ease. The canonical import for statsmodels.formula.api
is smf
.
Regression in statsmodels
happens in two steps. In the first step, we tell Python about a regression formula that we want to use. The linear equation
is described as y ~ x
in smf
.
Note that we do not need to specify an intercept term. The statsmodels
module will add one in by default. The reason why will be considered later.
After creating a variable that holds the regression information, we must then instruct statsmodels
to actually perform the regression. The reason for why this task is split into two steps will be shown later when we address heteroskedasticity.
The output from a .fit()
call will be a variable that stores all kinds of useful information.
For an overview of the regression estimation, use the .summary()
command on the output variable. If you put the .summary()
call inside of a print()
statement then the output will be displayed on your screen a bit better.
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 2.767
Date: Mon, 06 Dec 2021 Prob (F-statistic): 0.0978
Time: 12:05:57 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 213.240 4.517 0.000 542.790 1383.592
roe 18.5012 11.123 1.663 0.098 -3.428 40.431
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
There’s a lot of output here, some of it deals with topics covered while much of it deals with things we have not yet discussed.
The Intercept
(β0) is estimated to be 963.1913. The slope coefficient on roe
(β1) is estimated to be 18.5012.
Along with the ability to display summary output, the output variable also stores information about ˆy and ˆu. The information on ˆy is in .fittedvalues
while the information about ˆu is in .resid
. Of course, ˆu is merely y−ˆy, so once we have .fittedvalues
we could easily calculate ˆu ourselves without getting the data from .resid
(but statsmodels
is just super generous).
salary | pcsalary | sales | roe | pcroe | ros | indus | finance | consprod | utility | lsalary | lsales | salary_hat | u_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1095 | 20 | 27595.000000 | 14.1 | 106.400002 | 191 | 1 | 0 | 0 | 0 | 6.998509 | 10.225389 | 1224.058071 | -129.058071 |
1 | 1001 | 32 | 9958.000000 | 10.9 | -30.600000 | 13 | 1 | 0 | 0 | 0 | 6.908755 | 9.206132 | 1164.854261 | -163.854261 |
2 | 1122 | 9 | 6125.899902 | 23.5 | -16.299999 | 14 | 1 | 0 | 0 | 0 | 7.022868 | 8.720281 | 1397.969216 | -275.969216 |
3 | 578 | -9 | 16246.000000 | 5.9 | -25.700001 | -21 | 1 | 0 | 0 | 0 | 6.359574 | 9.695602 | 1072.348338 | -494.348338 |
4 | 1368 | 7 | 21783.199219 | 13.8 | -3.000000 | 56 | 1 | 0 | 0 | 0 | 7.221105 | 9.988894 | 1218.507712 | 149.492288 |
Let’s turn our attention to R2, a measure for the goodness of fit for a linear regression. Define:
to be the total variation in the dependent variable. Note that the variance of the dependent variable is SST divided by n. Also define
to be the variation explained by the regression. Finally, let
to be the variation in the dependent variable that is not explained by the regression.
Then, the R2 of the regression model is
ST | SE | SR | |
---|---|---|---|
count | 2.090000e+02 | 209.000000 | 2.090000e+02 |
mean | 1.874320e+06 | 24719.708325 | 1.849601e+06 |
std | 1.449754e+07 | 59047.801988 | 1.438029e+07 |
min | 3.535839e+00 | 2.427324 | 3.847426e-01 |
25% | 4.457149e+04 | 1532.299834 | 4.076691e+04 |
50% | 1.773417e+05 | 6882.895491 | 1.834925e+05 |
75% | 4.544373e+05 | 20740.935509 | 3.904399e+05 |
max | 1.833554e+08 | 523725.039775 | 1.822469e+08 |
This value for R2 matches the one shown in the .summary()
output above.
Likewise, if we assume that the variance of u is independent of x (meaning, in this instance, that the regression error has variance that does not depend on the level of roe
), then the formula for var(ˆβ1) is given by
where we can estimate var(u) with
Often, we report the square root of var(ˆβ1), which is referred to as the standard error of the regression model.
This output again matches the regression output from .summary()
above.
To see it a bit more slowly:
Knowledge of the standard error is important.
The formulas shown above assume that the variance of u is independent of x. We can allow for the two to be related with the cov_type_'HC3'
option to .fit()
.
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 7.191
Date: Mon, 06 Dec 2021 Prob (F-statistic): 0.00792
Time: 12:05:57 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: HC3
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 122.072 7.890 0.000 723.935 1202.448
roe 18.5012 6.899 2.682 0.007 4.979 32.024
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Note here that, unlike before, the 95% confidence interval for roe
excludes 0.