Regression¶

The purpose of regression is to fit a line. This line is used to interpret some sort of relationship between one or more “x” variables and a “y” variable. For example:

\[ y = \alpha + \beta x \]

describes a straight line with intercept \(\alpha\) and slope \(\beta\). The variable x is referred to as the regressor in this model. Models can have multiple regressors.

We can generate some data that draws such a line. In this example, take \(\alpha=4\) and \(\beta=0.5\).

import pandas as pd
import numpy as np

np.random.seed(0)
line = pd.DataFrame(columns=['x','y']) # creates an empty DataFrame (a DataFrame with no rows) named 'line' that has column names 'x' and 'y'
line['x'] = np.random.randint(low=-10, high=10, size=100)
line['y'] = 4 + 0.5 * line['x']
line.head()

	x	y
0	2	5.0
1	5	6.5
2	-10	-1.0
3	-7	0.5
4	-7	0.5

Using the seaborn plot function lineplot(), we can visualize this line.

import seaborn as sns

sns.lineplot(x='x', y='y', data=line)

<AxesSubplot:xlabel='x', ylabel='y'>

No surprises here, we get a straight line. The slope is one-half, and the intercept is 4.

Concept review: Report the descriptive statistics for x and y below.

line.describe()

	x	y
count	100.000000	100.000000
mean	-1.120000	3.440000
std	6.148466	3.074233
min	-10.000000	-1.000000
25%	-7.000000	0.500000
50%	-1.500000	3.250000
75%	5.000000	6.500000
max	9.000000	8.500000

In the real world, relationships between \(x\)’s and \(y\)’s are usually complicated. That is, there is no line that perfectly describes the connection between \(x\) and \(y\). Our goal is to find the best possible approximation of a relationship. For instance, change the above equation to

\[ y = \alpha + \beta x + u \]

where \(u\) is some random, un-observed phenomenon that makes any pair of values \((x,y)\) partially unpredictable. Consider the following simulated dataset, where \(y\) and \(x\) have a known relationship (we know \(\alpha=4\) and \(\beta=0.5\)) with some error thrown in to the mix.

np.random.seed(0)
df = pd.DataFrame(columns=['x','y'])
df['x'] = np.random.normal(2, 1, 100)
df['y'] = 4 + 0.5 * df['x'] + np.random.normal(0, 0.5, 100)
sns.scatterplot(x='x', y='y', data=df)

<AxesSubplot:xlabel='x', ylabel='y'>

In this example, the random error component \(u\) makes it so that \(y\) is no longer perfectly described by \(x\) and the parameters \(\alpha,\beta\).

Concept review: Create a new column of data called 'u'. Set 'u' equal to 'y' minus (4 plus one-half 'x'), then plot the distribution of 'u'.

df['u'] = df['y'] - 4 - .5*df['x']
sns.displot(df['u'], kde=True)

<seaborn.axisgrid.FacetGrid at 0x7f520fef43d0>

When performing analytics, we’ll be given data. Suppose that data looks like what’s described in df. So, for example, we can see that the first few and last few rows look like:

df.describe()

	x	y	u
count	100.000000	100.000000	100.000000
mean	2.059808	5.070910	0.041006
std	1.012960	0.765313	0.519940
min	-0.552990	3.120177	-1.111702
25%	1.356143	4.509295	-0.372715
50%	2.094096	5.037010	0.012327
75%	2.737077	5.754808	0.423740
max	4.269755	6.823602	1.191572

When all we have is the data, the values of \(\alpha\) and \(\beta\) are unknown to us. In this case, our mission is to use linear regression so that we have the best possible linear fit between \(y\) and \(x\).

Such a “line of best fit” is shown below.

sns.lmplot(x='x', y='y', data=df)

<seaborn.axisgrid.FacetGrid at 0x7f520fe443d0>

This line is solved, by seaborn, using regression. But seaborn only computes the regression fit for the purposes of making a plot. It doesn’t tell you what the beta coefficient, \(\beta\), is estimated to be. Nor does it report the estimated intercept \(\alpha\).

When parameters like \(\alpha\) and \(\beta\) are estimated, we usually put “hats” on them to denote that it’s an estimated value. Thus, the estimated relationship is given by

\[ \hat{y} = \hat{\alpha} + \hat{\beta} x \]

which tells us that, using the estimated values for \(\alpha\) and \(\beta\), one would estimate the value for \(y\) to be \(\hat{y}\), given the value \(x\).

There are many commands that can be used to generate this linear regression fit, and those will be discussed in a little bit.

First, let’s consider the effect of an outlier on the sample.

Let’s hard-code a data error in the pandas DataFrame variable df by using the .loc function to change the y value for one of the rows.

df.loc[0,'y'] = df.loc[0,'y']*10

What we’ve done in the above line of code is increase the y value of the first observation by a factor of 10. This is a reasonable error to consider, using real (non-simulated) data, someone could very well accidentally hit the zero key when entering in rows of x and y data by hand.

First, let’s look at the boxplot.

sns.displot(df['y'])

<seaborn.axisgrid.FacetGrid at 0x7f520fe19e20>

The outlier we’ve entered stands out clearly. We can verify that the outlier we’re seeing is in fact the first row we modified by typing:

df.iloc[0]

x     3.764052
y    68.236015
u     0.941575
Name: 0, dtype: float64

and noticing that the y value there is indeed the point we’re seeing as an outlier in the data.

What happens if we don’t correct for this issue when we perform an analytics task like linear regression? Note that the scatter plot likewise shows a big problem due to the outlier.

sns.scatterplot(x='x', y='y', data=df)

<AxesSubplot:xlabel='x', ylabel='y'>

How can we expect Python to find a line of best fit through that scatter plot, when there is such an egregious outlier in the plot?

Look what happens in the regression:

sns.lmplot(x='x', y='y', data=df)

<seaborn.axisgrid.FacetGrid at 0x7f520e307730>

While it may be hard to see the change, note that the line of best fit now reaches above all of the y values on the right hand side of the scatter plot, and dips below them on the left hand side. In the earlier regression, this did not happen. What’s occurred here is the line of best fit now has a much steeper slope than it did before. This is because the outlier point, which is towards the right side of the x distribution and is significantly above the rest of the y data, skews the line of best fit to be much more positively sloping.

We’ll discuss solutions to remove outliers on an ad hoc basis.

sns.lmplot(x='x', y='y', data=df[df['y'] < 20])

<seaborn.axisgrid.FacetGrid at 0x7f520e27f850>

For some real-world sample data, we will make use of resources available through the wooldridge module. The wooldridge module was written to correspond to the introductory textbook by Wooldridge.

In Jupyter, we can install user-written modules with the !pip magic command.

!pip install wooldridge --user

Requirement already satisfied: wooldridge in /home/james/.local/lib/python3.8/site-packages (0.4.4)
Requirement already satisfied: pandas in /mnt/software/anaconda3/lib/python3.8/site-packages (from wooldridge) (1.2.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /mnt/software/anaconda3/lib/python3.8/site-packages (from pandas->wooldridge) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /mnt/software/anaconda3/lib/python3.8/site-packages (from pandas->wooldridge) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in /mnt/software/anaconda3/lib/python3.8/site-packages (from pandas->wooldridge) (1.19.2)
Requirement already satisfied: six>=1.5 in /mnt/software/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->wooldridge) (1.15.0)

Load the 'ceosal1' data and describe it.

import wooldridge as woo
ceos = woo.data('ceosal1')
ceos.describe()

	salary	pcsalary	sales	roe	pcroe	ros	indus	finance	consprod	utility	lsalary	lsales
count	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000	209.000000
mean	1281.119617	13.282297	6923.793282	17.184211	10.800478	61.803828	0.320574	0.220096	0.287081	0.172249	6.950386	8.292265
std	1372.345308	32.633921	10633.271088	8.518509	97.219399	68.177052	0.467818	0.415306	0.453486	0.378503	0.566374	1.013161
min	223.000000	-61.000000	175.199997	0.500000	-98.900002	-58.000000	0.000000	0.000000	0.000000	0.000000	5.407172	5.165928
25%	736.000000	-1.000000	2210.300049	12.400000	-21.200001	21.000000	0.000000	0.000000	0.000000	0.000000	6.601230	7.700883
50%	1039.000000	9.000000	3705.199951	15.500000	-3.000000	52.000000	0.000000	0.000000	0.000000	0.000000	6.946014	8.217492
75%	1407.000000	20.000000	7177.000000	20.000000	19.500000	81.000000	1.000000	0.000000	1.000000	0.000000	7.249215	8.878636
max	14822.000000	212.000000	97649.898438	56.299999	977.000000	418.000000	1.000000	1.000000	1.000000	1.000000	9.603868	11.489144

We will consider the relation between ROE and CEO salary.

Concept check: Plot the line of best fit between 'salary' (y-axis) and 'roe' (x-axis).

sns.lmplot(x='roe', y='salary', data=ceos)

<seaborn.axisgrid.FacetGrid at 0x7f520e206580>

Note the outliers. To “zoom in”, let’s ignore those data points.

sns.lmplot(x='roe', y='salary', data=ceos[ceos['salary'] < 4000])

<seaborn.axisgrid.FacetGrid at 0x7f520e27f2b0>

There seems to be some sort of positive correlation, as indicated by the upward-sloping line. But the fit is not good. Why? This is something that we will investigate further.

Financial Modeling and Analytics Using Python

Regression¶