Regression Details

This notebook will explore a few additional facets of linear regression with a single variable. Begin by loading the CEOs data from the wooldridge model. We’ll print out a head of the DataFrame to see a snapshot of what the data looks like.

import wooldridge as woo
ceos = woo.data('ceosal1')
ceos.describe()
Copy to clipboard
salary pcsalary sales roe pcroe ros indus finance consprod utility lsalary lsales
count 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000
mean 1281.119617 13.282297 6923.793282 17.184211 10.800478 61.803828 0.320574 0.220096 0.287081 0.172249 6.950386 8.292265
std 1372.345308 32.633921 10633.271088 8.518509 97.219399 68.177052 0.467818 0.415306 0.453486 0.378503 0.566374 1.013161
min 223.000000 -61.000000 175.199997 0.500000 -98.900002 -58.000000 0.000000 0.000000 0.000000 0.000000 5.407172 5.165928
25% 736.000000 -1.000000 2210.300049 12.400000 -21.200001 21.000000 0.000000 0.000000 0.000000 0.000000 6.601230 7.700883
50% 1039.000000 9.000000 3705.199951 15.500000 -3.000000 52.000000 0.000000 0.000000 0.000000 0.000000 6.946014 8.217492
75% 1407.000000 20.000000 7177.000000 20.000000 19.500000 81.000000 1.000000 0.000000 1.000000 0.000000 7.249215 8.878636
max 14822.000000 212.000000 97649.898438 56.299999 977.000000 418.000000 1.000000 1.000000 1.000000 1.000000 9.603868 11.489144

We’ve already computed ordinary least squares regression coefficients “by hand” once, which is a rite of passage that will surely stay with you for the rest of your life. But now that we’ve seen how it’s done, it’s time to speed things up by asking Python to do the regression for us.

The module we’ll use for this is statsmodels. Within statsmodels, the forumla.api suite of tools will allow us to perform linear regression with ease. The canonical import for statsmodels.formula.api is smf.

import statsmodels.formula.api as smf
Copy to clipboard

Regression in statsmodels happens in two steps. In the first step, we tell Python about a regression formula that we want to use. The linear equation

y=β0+β1x+u

is described as y ~ x in smf.

reg = smf.ols(formula='salary ~ roe', data=ceos)
Copy to clipboard

Note that we do not need to specify an intercept term. The statsmodels module will add one in by default. The reason why will be considered later.

After creating a variable that holds the regression information, we must then instruct statsmodels to actually perform the regression. The reason for why this task is split into two steps will be shown later when we address heteroskedasticity.

res = reg.fit()
Copy to clipboard

The output from a .fit() call will be a variable that stores all kinds of useful information.

For an overview of the regression estimation, use the .summary() command on the output variable. If you put the .summary() call inside of a print() statement then the output will be displayed on your screen a bit better.

print(res.summary())
Copy to clipboard
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 06 Dec 2021   Prob (F-statistic):             0.0978
Time:                        12:05:57   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Copy to clipboard

There’s a lot of output here, some of it deals with topics covered while much of it deals with things we have not yet discussed.

The Intercept (β0) is estimated to be 963.1913. The slope coefficient on roe (β1) is estimated to be 18.5012.

Along with the ability to display summary output, the output variable also stores information about ˆy and ˆu. The information on ˆy is in .fittedvalues while the information about ˆu is in .resid. Of course, ˆu is merely yˆy, so once we have .fittedvalues we could easily calculate ˆu ourselves without getting the data from .resid (but statsmodels is just super generous).

ceos['salary_hat'] = res.fittedvalues # creates a column of y hat values for our DataFrame
ceos['u_hat'] = res.resid # creates a column of u hat values for our DataFrame
ceos.head()
Copy to clipboard
salary pcsalary sales roe pcroe ros indus finance consprod utility lsalary lsales salary_hat u_hat
0 1095 20 27595.000000 14.1 106.400002 191 1 0 0 0 6.998509 10.225389 1224.058071 -129.058071
1 1001 32 9958.000000 10.9 -30.600000 13 1 0 0 0 6.908755 9.206132 1164.854261 -163.854261
2 1122 9 6125.899902 23.5 -16.299999 14 1 0 0 0 7.022868 8.720281 1397.969216 -275.969216
3 578 -9 16246.000000 5.9 -25.700001 -21 1 0 0 0 6.359574 9.695602 1072.348338 -494.348338
4 1368 7 21783.199219 13.8 -3.000000 56 1 0 0 0 7.221105 9.988894 1218.507712 149.492288

Let’s turn our attention to R2, a measure for the goodness of fit for a linear regression. Define:

Sum of Squares, Total (SST)=ni=1(yiˉy)2

to be the total variation in the dependent variable. Note that the variance of the dependent variable is SST divided by n. Also define

Sum of Squares, Explained (SSE)=ni=1(ˆyiˉy)2

to be the variation explained by the regression. Finally, let

Sum of Squares, Residual (SSR)=ni=1(yiˆyi)2

to be the variation in the dependent variable that is not explained by the regression.

Then, the R2 of the regression model is

R2=SSESST=1SSRSST.
ceos['ST'] = (ceos['salary'] - ceos['salary'].mean())**2 # calculate R-squared
ceos['SE'] = (ceos['salary_hat'] - ceos['salary'].mean())**2
ceos['SR'] = (ceos['u_hat'])**2
ceos[['ST','SE','SR']].describe()
Copy to clipboard
ST SE SR
count 2.090000e+02 209.000000 2.090000e+02
mean 1.874320e+06 24719.708325 1.849601e+06
std 1.449754e+07 59047.801988 1.438029e+07
min 3.535839e+00 2.427324 3.847426e-01
25% 4.457149e+04 1532.299834 4.076691e+04
50% 1.773417e+05 6882.895491 1.834925e+05
75% 4.544373e+05 20740.935509 3.904399e+05
max 1.833554e+08 523725.039775 1.822469e+08
print(ceos['SE'].sum() / ceos['ST'].sum())
print(1 - ceos['SR'].sum() / ceos['ST'].sum())
Copy to clipboard
0.013188624081034115
0.01318862408103405
Copy to clipboard

This value for R2 matches the one shown in the .summary() output above.

Likewise, if we assume that the variance of u is independent of x (meaning, in this instance, that the regression error has variance that does not depend on the level of roe), then the formula for var(ˆβ1) is given by

var(ˆβ1)=var(u)ni=1(xiˉx)2

where we can estimate var(u) with

^var(u)=1n2ni=1ˆu2i

Often, we report the square root of var(ˆβ1), which is referred to as the standard error of the regression model.

import numpy as np
se_b1 = np.sqrt( ((res.resid**2).sum()/(len(ceos['SR'])-2)) / ((ceos['roe'] - ceos['roe'].mean())**2).sum() )
print(se_b1)
Copy to clipboard
11.123250903287637
Copy to clipboard

This output again matches the regression output from .summary() above.

To see it a bit more slowly:

n = len(ceos['SR'])
varHat_u = (1/(n-2)) * (res.resid**2).sum()
var_b1 = varHat_u / ((ceos['roe'] - ceos['roe'].mean())**2).sum()
se_b1 = np.sqrt(var_b1)
print(se_b1)
Copy to clipboard
11.123250903287637
Copy to clipboard

Knowledge of the standard error is important.

The formulas shown above assume that the variance of u is independent of x. We can allow for the two to be related with the cov_type_'HC3' option to .fit().

res2 = reg.fit(cov_type='HC3')
print(res2.summary())
Copy to clipboard
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     7.191
Date:                Mon, 06 Dec 2021   Prob (F-statistic):            0.00792
Time:                        12:05:57   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    122.072      7.890      0.000     723.935    1202.448
roe           18.5012      6.899      2.682      0.007       4.979      32.024
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Copy to clipboard

Note here that, unlike before, the 95% confidence interval for roe excludes 0.