Regression Details¶

This notebook will explore a few additional facets of linear regression with a single variable. Begin by loading the CEOs data from the wooldridge model. We’ll print out a head of the DataFrame to see a snapshot of what the data looks like.

import wooldridge as woo
ceos = woo.data('ceosal1')
ceos.describe()

	salary	pcsalary	sales	roe	pcroe	ros	indus	finance	consprod	utility	lsalary	lsales
count	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000
mean	1281.119617	13.282297	6923.793282	17.184211	10.800478	61.803828	0.320574	0.220096	0.287081	0.172249	6.950386	8.292265
std	1372.345308	32.633921	10633.271088	8.518509	97.219399	68.177052	0.467818	0.415306	0.453486	0.378503	0.566374	1.013161
min	223.000000	-61.000000	175.199997	0.500000	-98.900002	-58.000000	0.000000	0.000000	0.000000	0.000000	5.407172	5.165928
25%	736.000000	-1.000000	2210.300049	12.400000	-21.200001	21.000000	0.000000	0.000000	0.000000	0.000000	6.601230	7.700883
50%	1039.000000	9.000000	3705.199951	15.500000	-3.000000	52.000000	0.000000	0.000000	0.000000	0.000000	6.946014	8.217492
75%	1407.000000	20.000000	7177.000000	20.000000	19.500000	81.000000	1.000000	0.000000	1.000000	0.000000	7.249215	8.878636
max	14822.000000	212.000000	97649.898438	56.299999	977.000000	418.000000	1.000000	1.000000	1.000000	1.000000	9.603868	11.489144

We’ve already computed ordinary least squares regression coefficients “by hand” once, which is a rite of passage that will surely stay with you for the rest of your life. But now that we’ve seen how it’s done, it’s time to speed things up by asking Python to do the regression for us.

The module we’ll use for this is statsmodels. Within statsmodels, the forumla.api suite of tools will allow us to perform linear regression with ease. The canonical import for statsmodels.formula.api is smf.

import statsmodels.formula.api as smf

Regression in statsmodels happens in two steps. In the first step, we tell Python about a regression formula that we want to use. The linear equation

\[ y = \beta_0 + \beta_1 x + u \]

is described as y ~ x in smf.

reg = smf.ols(formula='salary ~ roe', data=ceos)

Note that we do not need to specify an intercept term. The statsmodels module will add one in by default. The reason why will be considered later.

After creating a variable that holds the regression information, we must then instruct statsmodels to actually perform the regression. The reason for why this task is split into two steps will be shown later when we address heteroskedasticity.

res = reg.fit()

The output from a .fit() call will be a variable that stores all kinds of useful information.

For an overview of the regression estimation, use the .summary() command on the output variable. If you put the .summary() call inside of a print() statement then the output will be displayed on your screen a bit better.

print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 06 Dec 2021   Prob (F-statistic):             0.0978
Time:                        12:05:57   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

There’s a lot of output here, some of it deals with topics covered while much of it deals with things we have not yet discussed.

The Intercept (\(\beta_0\)) is estimated to be \(963.1913\). The slope coefficient on roe (\(\beta_1\)) is estimated to be \(18.5012\).

Along with the ability to display summary output, the output variable also stores information about \(\hat{y}\) and \(\hat{u}\). The information on \(\hat{y}\) is in .fittedvalues while the information about \(\hat{u}\) is in .resid. Of course, \(\hat{u}\) is merely \(y - \hat{y}\), so once we have .fittedvalues we could easily calculate \(\hat{u}\) ourselves without getting the data from .resid (but statsmodels is just super generous).

ceos['salary_hat'] = res.fittedvalues # creates a column of y hat values for our DataFrame
ceos['u_hat'] = res.resid # creates a column of u hat values for our DataFrame
ceos.head()

	salary	pcsalary	sales	roe	pcroe	ros	indus	lsalary	lsales	salary_hat	u_hat
0	1095	20	27595.000000	14.1	106.400002	191	1	6.998509	10.225389	1224.058071	-129.058071
1	1001	32	9958.000000	10.9	-30.600000	13	1	6.908755	9.206132	1164.854261	-163.854261
2	1122	9	6125.899902	23.5	-16.299999	14	1	7.022868	8.720281	1397.969216	-275.969216
3	578	-9	16246.000000	5.9	-25.700001	-21	1	6.359574	9.695602	1072.348338	-494.348338
4	1368	7	21783.199219	13.8	-3.000000	56	1	7.221105	9.988894	1218.507712	149.492288

Let’s turn our attention to \(R^2\), a measure for the goodness of fit for a linear regression. Define:

\[ \text{Sum of Squares, Total (SST)} = \sum_{i=1}^n (y_i - \bar{y})^2 \]

to be the total variation in the dependent variable. Note that the variance of the dependent variable is SST divided by \(n\). Also define

\[ \text{Sum of Squares, Explained (SSE)} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]

to be the variation explained by the regression. Finally, let

\[ \text{Sum of Squares, Residual (SSR)} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

to be the variation in the dependent variable that is not explained by the regression.

Then, the \(R^2\) of the regression model is

\[ R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}. \]

ceos['ST'] = (ceos['salary'] - ceos['salary'].mean())**2 # calculate R-squared
ceos['SE'] = (ceos['salary_hat'] - ceos['salary'].mean())**2
ceos['SR'] = (ceos['u_hat'])**2
ceos[['ST','SE','SR']].describe()

	ST	SE	SR
count	2.090000e+02	209.000000	2.090000e+02
mean	1.874320e+06	24719.708325	1.849601e+06
std	1.449754e+07	59047.801988	1.438029e+07
min	3.535839e+00	2.427324	3.847426e-01
25%	4.457149e+04	1532.299834	4.076691e+04
50%	1.773417e+05	6882.895491	1.834925e+05
75%	4.544373e+05	20740.935509	3.904399e+05
max	1.833554e+08	523725.039775	1.822469e+08

print(ceos['SE'].sum() / ceos['ST'].sum())
print(1 - ceos['SR'].sum() / ceos['ST'].sum())

0.013188624081034115
0.01318862408103405

This value for \(R^2\) matches the one shown in the .summary() output above.

Likewise, if we assume that the variance of \(u\) is independent of \(x\) (meaning, in this instance, that the regression error has variance that does not depend on the level of roe), then the formula for \(var(\hat{\beta}_1)\) is given by

\[ var(\hat{\beta}_1) = \frac{var(u)}{\sum_{i=1}^n(x_i - \bar{x})^2} \]

where we can estimate \(var(u)\) with

\[ \hat{var}(u) = \frac{1}{n-2}\sum_{i=1}^n \hat{u}_i^2 \]

Often, we report the square root of \(var(\hat{\beta}_1)\), which is referred to as the standard error of the regression model.

import numpy as np
se_b1 = np.sqrt( ((res.resid**2).sum()/(len(ceos['SR'])-2)) / ((ceos['roe'] - ceos['roe'].mean())**2).sum() )
print(se_b1)

11.123250903287637

This output again matches the regression output from .summary() above.

To see it a bit more slowly:

n = len(ceos['SR'])
varHat_u = (1/(n-2)) * (res.resid**2).sum()
var_b1 = varHat_u / ((ceos['roe'] - ceos['roe'].mean())**2).sum()
se_b1 = np.sqrt(var_b1)
print(se_b1)

11.123250903287637

Knowledge of the standard error is important.

The formulas shown above assume that the variance of \(u\) is independent of \(x\). We can allow for the two to be related with the cov_type_'HC3' option to .fit().

res2 = reg.fit(cov_type='HC3')
print(res2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     7.191
Date:                Mon, 06 Dec 2021   Prob (F-statistic):            0.00792
Time:                        12:05:57   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    122.072      7.890      0.000     723.935    1202.448
roe           18.5012      6.899      2.682      0.007       4.979      32.024
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Note here that, unlike before, the 95% confidence interval for roe excludes \(0\).

Financial Modeling and Analytics Using Python

Regression Details¶