Regression Details¶
This notebook will explore a few additional facets of linear regression with a single variable. Begin by loading the CEOs data from the wooldridge
model. We’ll print out a head of the DataFrame to see a snapshot of what the data looks like.
import wooldridge as woo
ceos = woo.data('ceosal1')
ceos.describe()
salary | pcsalary | sales | roe | pcroe | ros | indus | finance | consprod | utility | lsalary | lsales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 | 209.000000 |
mean | 1281.119617 | 13.282297 | 6923.793282 | 17.184211 | 10.800478 | 61.803828 | 0.320574 | 0.220096 | 0.287081 | 0.172249 | 6.950386 | 8.292265 |
std | 1372.345308 | 32.633921 | 10633.271088 | 8.518509 | 97.219399 | 68.177052 | 0.467818 | 0.415306 | 0.453486 | 0.378503 | 0.566374 | 1.013161 |
min | 223.000000 | -61.000000 | 175.199997 | 0.500000 | -98.900002 | -58.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.407172 | 5.165928 |
25% | 736.000000 | -1.000000 | 2210.300049 | 12.400000 | -21.200001 | 21.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.601230 | 7.700883 |
50% | 1039.000000 | 9.000000 | 3705.199951 | 15.500000 | -3.000000 | 52.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.946014 | 8.217492 |
75% | 1407.000000 | 20.000000 | 7177.000000 | 20.000000 | 19.500000 | 81.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 7.249215 | 8.878636 |
max | 14822.000000 | 212.000000 | 97649.898438 | 56.299999 | 977.000000 | 418.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 9.603868 | 11.489144 |
We’ve already computed ordinary least squares regression coefficients “by hand” once, which is a rite of passage that will surely stay with you for the rest of your life. But now that we’ve seen how it’s done, it’s time to speed things up by asking Python to do the regression for us.
The module we’ll use for this is statsmodels
. Within statsmodels
, the forumla.api
suite of tools will allow us to perform linear regression with ease. The canonical import for statsmodels.formula.api
is smf
.
import statsmodels.formula.api as smf
Regression in statsmodels
happens in two steps. In the first step, we tell Python about a regression formula that we want to use. The linear equation
is described as y ~ x
in smf
.
reg = smf.ols(formula='salary ~ roe', data=ceos)
Note that we do not need to specify an intercept term. The statsmodels
module will add one in by default. The reason why will be considered later.
After creating a variable that holds the regression information, we must then instruct statsmodels
to actually perform the regression. The reason for why this task is split into two steps will be shown later when we address heteroskedasticity.
res = reg.fit()
The output from a .fit()
call will be a variable that stores all kinds of useful information.
For an overview of the regression estimation, use the .summary()
command on the output variable. If you put the .summary()
call inside of a print()
statement then the output will be displayed on your screen a bit better.
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 2.767
Date: Mon, 06 Dec 2021 Prob (F-statistic): 0.0978
Time: 12:05:57 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 213.240 4.517 0.000 542.790 1383.592
roe 18.5012 11.123 1.663 0.098 -3.428 40.431
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
There’s a lot of output here, some of it deals with topics covered while much of it deals with things we have not yet discussed.
The Intercept
(\(\beta_0\)) is estimated to be \(963.1913\). The slope coefficient on roe
(\(\beta_1\)) is estimated to be \(18.5012\).
Along with the ability to display summary output, the output variable also stores information about \(\hat{y}\) and \(\hat{u}\). The information on \(\hat{y}\) is in .fittedvalues
while the information about \(\hat{u}\) is in .resid
. Of course, \(\hat{u}\) is merely \(y - \hat{y}\), so once we have .fittedvalues
we could easily calculate \(\hat{u}\) ourselves without getting the data from .resid
(but statsmodels
is just super generous).
ceos['salary_hat'] = res.fittedvalues # creates a column of y hat values for our DataFrame
ceos['u_hat'] = res.resid # creates a column of u hat values for our DataFrame
ceos.head()
salary | pcsalary | sales | roe | pcroe | ros | indus | finance | consprod | utility | lsalary | lsales | salary_hat | u_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1095 | 20 | 27595.000000 | 14.1 | 106.400002 | 191 | 1 | 0 | 0 | 0 | 6.998509 | 10.225389 | 1224.058071 | -129.058071 |
1 | 1001 | 32 | 9958.000000 | 10.9 | -30.600000 | 13 | 1 | 0 | 0 | 0 | 6.908755 | 9.206132 | 1164.854261 | -163.854261 |
2 | 1122 | 9 | 6125.899902 | 23.5 | -16.299999 | 14 | 1 | 0 | 0 | 0 | 7.022868 | 8.720281 | 1397.969216 | -275.969216 |
3 | 578 | -9 | 16246.000000 | 5.9 | -25.700001 | -21 | 1 | 0 | 0 | 0 | 6.359574 | 9.695602 | 1072.348338 | -494.348338 |
4 | 1368 | 7 | 21783.199219 | 13.8 | -3.000000 | 56 | 1 | 0 | 0 | 0 | 7.221105 | 9.988894 | 1218.507712 | 149.492288 |
Let’s turn our attention to \(R^2\), a measure for the goodness of fit for a linear regression. Define:
to be the total variation in the dependent variable. Note that the variance of the dependent variable is SST divided by \(n\). Also define
to be the variation explained by the regression. Finally, let
to be the variation in the dependent variable that is not explained by the regression.
Then, the \(R^2\) of the regression model is
ceos['ST'] = (ceos['salary'] - ceos['salary'].mean())**2 # calculate R-squared
ceos['SE'] = (ceos['salary_hat'] - ceos['salary'].mean())**2
ceos['SR'] = (ceos['u_hat'])**2
ceos[['ST','SE','SR']].describe()
ST | SE | SR | |
---|---|---|---|
count | 2.090000e+02 | 209.000000 | 2.090000e+02 |
mean | 1.874320e+06 | 24719.708325 | 1.849601e+06 |
std | 1.449754e+07 | 59047.801988 | 1.438029e+07 |
min | 3.535839e+00 | 2.427324 | 3.847426e-01 |
25% | 4.457149e+04 | 1532.299834 | 4.076691e+04 |
50% | 1.773417e+05 | 6882.895491 | 1.834925e+05 |
75% | 4.544373e+05 | 20740.935509 | 3.904399e+05 |
max | 1.833554e+08 | 523725.039775 | 1.822469e+08 |
print(ceos['SE'].sum() / ceos['ST'].sum())
print(1 - ceos['SR'].sum() / ceos['ST'].sum())
0.013188624081034115
0.01318862408103405
This value for \(R^2\) matches the one shown in the .summary()
output above.
Likewise, if we assume that the variance of \(u\) is independent of \(x\) (meaning, in this instance, that the regression error has variance that does not depend on the level of roe
), then the formula for \(var(\hat{\beta}_1)\) is given by
where we can estimate \(var(u)\) with
Often, we report the square root of \(var(\hat{\beta}_1)\), which is referred to as the standard error of the regression model.
import numpy as np
se_b1 = np.sqrt( ((res.resid**2).sum()/(len(ceos['SR'])-2)) / ((ceos['roe'] - ceos['roe'].mean())**2).sum() )
print(se_b1)
11.123250903287637
This output again matches the regression output from .summary()
above.
To see it a bit more slowly:
n = len(ceos['SR'])
varHat_u = (1/(n-2)) * (res.resid**2).sum()
var_b1 = varHat_u / ((ceos['roe'] - ceos['roe'].mean())**2).sum()
se_b1 = np.sqrt(var_b1)
print(se_b1)
11.123250903287637
Knowledge of the standard error is important.
The formulas shown above assume that the variance of \(u\) is independent of \(x\). We can allow for the two to be related with the cov_type_'HC3'
option to .fit()
.
res2 = reg.fit(cov_type='HC3')
print(res2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 7.191
Date: Mon, 06 Dec 2021 Prob (F-statistic): 0.00792
Time: 12:05:57 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: HC3
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 122.072 7.890 0.000 723.935 1202.448
roe 18.5012 6.899 2.682 0.007 4.979 32.024
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Note here that, unlike before, the 95% confidence interval for roe
excludes \(0\).