Regression

What function we use to estimate a relationship between \(y\) and the regressor(s) depends on the data that we have. As practice using multiple x-variables, let’s simulate a dataset that is generated by the following equation $\( y = 4 + 0.5 x + 3 x^2 + u \)$

The relationship between \(y\) and \(x\) in the above equation is said to be nonlinear in \(x\).

import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf

np.random.seed(0)

df = pd.DataFrame(columns=['y', 'x', 'x2'])
df['x'] = np.random.normal(0, 1, 100)
df['x2'] = df['x'] ** 2
df['y'] = 4 + 0.5 * df['x'] + 3 * df['x2'] + np.random.normal(0, 0.5, 100)
df.head()
y x x2
0 15.159244 1.764052 3.111881
1 4.006576 0.400157 0.160126
2 6.727911 0.978738 0.957928
3 20.669952 2.240893 5.021602
4 14.810536 1.867558 3.487773

Note that to square column 'x' to produce 'x2', all we need to do is type df2['x'] ** 2 since ** is the command for exponentiation.

Next, plot a relationship between just y and x, assuming a linear fit.

sns.lmplot(x='x', y='y', data=df)
<seaborn.axisgrid.FacetGrid at 0x7fe0343631f0>
../_images/3_Regression_4_1.png

The estimated straight line gets things horribly wrong! The line is pretty far away from a lot of the actual data points.

The scatter plot of the data above makes clear that the data demonstrates a non-linear relationship.

What we ought do to is fit a line given by $\( \hat{y} = c + \hat{\beta}_1 x + \hat{\beta}_2 x^2 \)$

where there are two beta coefficients. One for \(x\), and one for \(x^2\).

model = smf.ols('y ~ x + x2', data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     2885.
Date:                Mon, 06 Dec 2021   Prob (F-statistic):           3.91e-87
Time:                        12:06:14   Log-Likelihood:                -75.005
No. Observations:                 100   AIC:                             156.0
Df Residuals:                      97   BIC:                             163.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.0714      0.066     61.584      0.000       3.940       4.203
x              0.5615      0.052     10.829      0.000       0.459       0.664
x2             2.9666      0.040     73.780      0.000       2.887       3.046
==============================================================================
Omnibus:                        7.876   Durbin-Watson:                   2.003
Prob(Omnibus):                  0.019   Jarque-Bera (JB):                3.580
Skew:                           0.188   Prob(JB):                        0.167
Kurtosis:                       2.152   Cond. No.                         2.47
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The variable model2 stores a lot of data. Beyond holding summary output for the performance of the regression (which we’ve accessed via the model2.summary() command), we can reference the estimated parameters directly via .params. For example:

print('Intercept: ', model.params['Intercept'])
print('beta for x:', model.params['x'])
print('beta for x2:', model.params['x2'])
Intercept:  4.071350754098632
beta for x: 0.5615154693913407
beta for x2: 2.9666242239334637

As an aside, the fact that we’re using square brackets to reference items inside of squared.params is a clue that the .params component of the variable squared was built using a dictionary-like structure.

One way to tell that the estimates of squared are better than the estimates of straight is to look at the R-squared value in the summary output. This measure takes a value between 0 and 1, with a score of 1 indicating perfect fit. For comparison purposes, consider the linear fit model below:

model_linear = smf.ols('y ~ x', data=df).fit()
print(model_linear.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.056
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     5.767
Date:                Mon, 06 Dec 2021   Prob (F-statistic):             0.0182
Time:                        12:06:14   Log-Likelihood:                -277.26
No. Observations:                 100   AIC:                             558.5
Df Residuals:                      98   BIC:                             563.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.0734      0.392     18.054      0.000       6.296       7.851
x              0.9318      0.388      2.401      0.018       0.162       1.702
==============================================================================
Omnibus:                       49.717   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              128.463
Skew:                           1.877   Prob(JB):                     1.27e-28
Kurtosis:                       7.091   Cond. No.                         1.06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The linear-fit model produced an R-squared of 0.056 while the quadratic-fit model yielded a R-squared of 0.983.

Another informative way to jude a modeled relationship is by plotting the residuals. The residuals are the “un-expected” part of the equation. For example, in the linear-fit model, the residual is defined as $\( \hat{u} := y - \hat{y} = y - \hat{c} - \hat{\beta} x \)\( and in the quadratic-fit model the residual, \)\hat{u}\(, is given by \)\( \hat{u} := y - \hat{y} = y- \hat{c} - \hat{\beta}_1 x - \hat{\beta}_2 x^2 \)$ Using the information in .params, we can calculate residuals.

df['resid'] = model.resid
df['resid_linear'] = model_linear.resid

Next, plot the two sets of residuals.

sns.scatterplot(x='x', y='resid_linear', data=df)
<AxesSubplot:xlabel='x', ylabel='resid_linear'>
../_images/3_Regression_14_1.png

In the straight-line model, we can see that the errors have a noticeable pattern to them. This is an indication that a more complicated function of \(x\) would be a better description for the relationship between \(x\) and \(y\).

sns.scatterplot(x='x', y='resid', data=df)
<AxesSubplot:xlabel='x', ylabel='resid'>
../_images/3_Regression_16_1.png

In comparison, the residuals from the squared model look more random. Additionally, they’re substantially smaller on average, with almost all residuals having an absolute value less than one. This indicates a much better fit than the straight-line model in which the residual values were often much larger.

Looking Beyond OLS

OLS works well when the \(y\) variable in our model is a linear combination of \(x\) variables. Note that the relationship between \(y\) and a given regressor may be nonlinear, as in the case of \(y\) being a function of \(x\) and \(x^2\). However, while we may say that \(y\) is a nonlinear function of \(x\) in this case, the variable \(y\) is still a linear function of \(x\) and \(x^2\). To clarify: $\( y = \alpha + \beta x + u \)\( is linear in \)x\(. Likewise: \)\( y = \alpha + \beta_1 x + \beta_2 x^2 + u \)\( is linear in \)x\( and \)x^2\(. In contrast, the function \)\( y = \frac{e^{\alpha + \beta x + u}}{1 + e^{\alpha + \beta x + u}} \)\( is *nonlinear*. This last equation may look terrifyingly unnatural, but it's actually very useful. Let's get a sense of what the equation looks like by plotting the function \)\( y = \frac{e^{x}}{1 + e^{x}}. \)$

np.random.seed(0)
curve = pd.DataFrame(columns=['x','y'])
curve['x'] = np.random.randint(low=-10, high=10, size=100)
curve['y'] = np.exp(curve['x']) / (1 + np.exp(curve['x']))
print(curve.head())
sns.lineplot(x='x', y='y', data=curve)
    x         y
0   2  0.880797
1   5  0.993307
2 -10  0.000045
3  -7  0.000911
4  -7  0.000911
<AxesSubplot:xlabel='x', ylabel='y'>
../_images/3_Regression_19_2.png

The function \(y = e^{x}/(1+e^{x})\) creates an S-curve with a lower bound of \(0\) and an upper bound of \(1\).

These bounds are useful in analytics. Often, our task is to estimate probabilities. For instance, what is the likelihood that a borrower defaults on their mortgage, the likelihood that a credit card transaction is fraudulent, or the likelihood that a company will violate its capital expenditure covenant on an outstanding loan? All of these questions require us to estimate a probability. The above function is useful because it considers a situation in which the \(y\) variable is necessarily between \(0\) and \(1\) (i.e. \(0\%\) probability and \(100\%\) probability).

Suppose that we instead took the curve data above and tried to fit it with linear regression.

sns.lmplot(x='x', y='y', data=curve)
<seaborn.axisgrid.FacetGrid at 0x7fe005c01ac0>
../_images/3_Regression_21_1.png

Note that in the above plot, there are estimated “probabilities” (the \(\hat{y}\) values) that are either lower than \(0\) or higher than \(1\). This is statistically impossible.

That fancy S-shaped function above is called the inverse logit function. That is because \(e^x / (1+e^x)\) is the inverse of the logit function given by \(log(x / (1-x)\). Just like we can invert the equation \(y = f(x)\) to get \(f^{-1}(y) = x\), the inverse of \(y = e^x / (1+e^x)\) is given by \(log(y / (1-y)) = x\).

When we model probabilities, the \(y\) variable will be either \(0\) or \(1\) for any observations. Observations where the event occurred are recorded with \(y=1\). For instance, in a dataset about mortgage default, those borrowers who default on their mortgage would have \(y=1\) and everyone else would have \(y=0\).

Suppose that we estimate a model given by $\( log\Big(\frac{y}{1-y}\Big) = \alpha + \beta x + u \)\(. Then, the estimated value \)\hat{\alpha} + \hat{\beta}x\( for a given observation would be \)log(\hat{y}/(1-\hat{y}))\(. This is what we refer to as the *log odds ratio*. The odds ratio is the probability that \)y\( equals \)1\( divided by the probability that \)y\( equals \)0\(; this is what \)\hat{y}/(1-\hat{y})\( tells us. Ultimately, our estimate for \)\hat{y}\( then tells us the likelihood that the true value for \)y\( is equal to \)1$.

This type of model is called a logistic regression model. Let’s simulate some data.

np.random.seed(0)
df2 = pd.DataFrame(columns=['x', 'y'])
df2['x'] = np.random.normal(2, 1, 1000)
df2['xb'] = -9 + 4 * df2['x']
df2['y'] = np.random.binomial(n=1, p= np.exp(df2['xb']) / (1+np.exp(df2['xb'])) )
df2.head()
x y xb
0 3.764052 1 6.056209
1 2.400157 0 0.600629
2 2.978738 1 2.914952
3 4.240893 1 7.963573
4 3.867558 1 6.470232

Now try fitting a model using linear regression (ordinary least squares). When OLS is applied to a \(y\) variable that only takes value 0 or 1, the model is referred to as a linear probability model (LPM).

lpm = smf.ols('y ~ x', data=df2).fit()
print(lpm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.514
Model:                            OLS   Adj. R-squared:                  0.514
Method:                 Least Squares   F-statistic:                     1056.
Date:                Mon, 06 Dec 2021   Prob (F-statistic):          1.41e-158
Time:                        12:06:16   Log-Likelihood:                -346.17
No. Observations:                1000   AIC:                             696.3
Df Residuals:                     998   BIC:                             706.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2928      0.024    -12.187      0.000      -0.340      -0.246
x              0.3564      0.011     32.493      0.000       0.335       0.378
==============================================================================
Omnibus:                      113.684   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.844
Skew:                           0.207   Prob(JB):                     6.06e-09
Kurtosis:                       2.142   Cond. No.                         5.70
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now fit the model using logistic regression.

logit = smf.logit('y ~ x', data=df2).fit()
print(logit.summary())
Optimization terminated successfully.
         Current function value: 0.292801
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      998
Method:                           MLE   Df Model:                            1
Date:                Mon, 06 Dec 2021   Pseudo R-squ.:                  0.5660
Time:                        12:06:16   Log-Likelihood:                -292.80
converged:                       True   LL-Null:                       -674.60
Covariance Type:            nonrobust   LLR p-value:                4.432e-168
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -8.5975      0.567    -15.154      0.000      -9.709      -7.486
x              3.8987      0.259     15.062      0.000       3.391       4.406
==============================================================================

The estimated parameters (intercept and \(\beta\) coefficient) are much closer to their true value using logistic regression. While popular, linear probability models do not often give meaningful estimates. This is especially true when outcomes are rare (meaning most \(y\) values are either 0 or 1, and there is not a relatively even balance between the two).