DataFrames

Let’s continue with the example of stock return data, and now suppose that we want to run a CAPM equation to estimate a firm’s cost of equity. Recall that CAPM states

\[ r_{i,t}-r_{f,t} = \beta_i(r_{m,t} - r_{f,t}) \]

where \(r_{i,t}\) is the stock return for firm \(i\) at time \(t\), \(r_{f,t}\) is the risk-free rate at time \(t\), and \(r_{m,t}\) is the market return at time \(t\). The difference \(r_{m,t}-r_{f,t}\) is called the market risk premium.

For the sake of continuity, the daily stock return data will match that of the previous section, so that the keys in daily_dict correspond to \(\text{Date}\) and the valeus in daily_dict correspond to \(r_{i,t}\) in the table below. Now let’s add in information on the market return and the risk free rate.

Date

\(r_i\)

\(r_m\)

\(r_f\)

7/8/19

0.03

0.02

0.01

7/9/19

0.01

0.01

0.01

7/10/19

-0.02

-0.03

0.01

7/11/19

0.01

0.02

0.01

7/12/19

-0.01

-0.03

0.01

The question becomes: how do we store this information in Python?

In order to estimate at a CAPM equation (which is covered in the next chapter), we will need to be able to reference \(r_{i,t}\), \(r_{m,t}\), and \(r_{f,t}\) individually, and tell Python what to do with each of these series of returns.

If we use lists to keep track of everything then we will have the following:

r_i = [0.03, 0.01, -0.02, 0.01, -0.01]
r_m = [0.02, 0.01, -0.03, 0.02, -0.03]
r_f = [0.01, 0.01, 0.01, 0.01, 0.01]
t = ['7/8/19', '7/9/19', '7/10/19', '7/11/19', '7/12/19']

This is a bad practice for at least a couple of reasons. First, one has to be extremely careful that the data line up properly so that r_i[k], r_m[k], and r_f[k] correspond to returns on the same date for every k. Suppose that there is some missing data for some firm \(j\) such that r_j = [0.05, 0.02, 0.03, -0.04]. Without any additional information, it is impossible to track down which data point for firm \(j\) is missing. Second, it’s difficult to reference returns at a particular date. Suppose that you want to know the market return on 7/11/19. You would first have to determine that t[k] = '7/11/19' when k = 3 and then go look up r_m[3] to get the market return. Note that neither of these considerations imply that it is impossible to use lists to hold all the requisite data, it’s just inconvenient for the programmer. Remember, Python is supposed to make your life easier.

Suppose instead that we use a dictionary based approach to hold all of the data. One way to do this is to construct a nested dictionary. Nested dictionaries look like:

nested_dict = {'finance':
                {'verb':'to supply with money or capital',
                 'noun':'the management of revenues'},
               'rate':
                {'verb':'to estimate the value or worth of',
                 'noun':'the amount of a charge or payment with reference to some basis of calculation'}
              }

Here there is a dictionary with keys 'finance' and 'rate'. The key 'finance' has a value which is itself a dictionary: one key named 'verb' which has value equal to the definition of “finance” used as a verb and another key named 'noun' which has value equal to the definition of “finance” used as a noun. Likewise the key 'rate' has a value which is itself a dictionary.

If one enters nested_dict['finance'], Python will return the dictionary {'verb':'to supply with money or capital', 'noun':'the management of revenues'}. Thus, to look up what the noun usage of “finance” is, one simply types nested_dict['finance']['noun'], which tells Python to look up the noun value to the dictionary value given by the key 'finance'.

In the conext of the current problem, a nested dictionary might look like:

daily_returns = {'r_i': {'7/8/19':0.03, '7/9/19':0.01, '7/10/19':-0.02, '7/11/19':0.01, '7/12/19':-0.01},
                 'r_m': {'7/8/19':0.02, '7/9/19':0.01, '7/10/19':-0.03, '7/11/19':0.02, '7/12/19':-0.03},
                 'r_f': {'7/8/19':0.01, '7/9/19':0.01, '7/10/19':0.01, '7/11/19':0.01, '7/12/19':0.01}}

The dictionary approach adds a convenience to the problem. Now, if one wants to look up the market return on 7/11/19 one simply has to type daily_returns['r_m']['7/11/19'].

Unfortunately, this approach still isn’t perfect. Remember, CAPM tells us that the excess return \(r_{i,t}-r_{f,t}\) is defined by a linear function of the market risk premium \(r_{m,t}-r_{f,t}\). Therefore, what we want to do is to be able to tell Python to calculate the excess return and the market risk premium. This is doable with a nested dictionary format, but it’s a little complicated. One would have to do something like the following (feel free to skip this code):

daily_returns['r_excess'] = {}
daily_returns['r_riskpremium'] = {}
for t in daily_returns['r_f'].values():
    daily_returns['r_excess'].update(t:daily_returns['r_i'][t]-daily_returns['r_f'][t])
    daily_returns['r_riskpremium'].update(t:daily_returns['r_m'][t]-daily_returns['r_f'][t])

This is five long lines of code. That means lots of typing and plenty of room for typos. Nested dictionaries are not the recommended way to hold and manage data.

So what is the recommended way to hold data in Python? We need to use pandas.

Pandas is the standard module for working with datasets in Python, written by Wes McKinney while at AQR Capital Management. A module is a collection of user-defined functions, written by one or more people, and made available for use by everyone. The wacc() function from earlier in this chapter is an example of a user-defined function. When a set of userful and related functions are created, the author(s) of these functions sometimes choose to share this set of functions with the world to make Python an even more convenient language for everyone else.

Modules that are not part of the standard Python library need to be imported using an import statement like:

import pandas

In the above, the module pandas is imported. What this means is that now the functions available under pandas are available for you to use. If you do not import the pandas module, the functions inside of it are unkown to Python.

To reference the functions inside of the pandas module, we use pandas.<function_name>(). For example, pandas has a function called read_csv(), so we would use this function by typing pandas.read_csv().

There are some modules, like numpy or pandas that are so frequently used in Python programming, they have unofficial ‘’standard’’ ways of importing the module with an abbreviation for our later convience. Two examples are given below:

import pandas as pd
import numpy as np

Basically, import numpy as np tells Python that we want to import numpy, but give it a nickname of np so that we can refer to the module by that nickname.

Now, instead of having to type pandas.read_csv() as before, we simply refer to pandas by its nickname and type pd.read_csv(). It may not seem like a huge shortcut, but it’s a very common.

We will use pandas for a lot of our work in this class. Note that tutorials and recipes (example pieces of code to perform fairly common tasks) are easy to find online.

The stock return data that we will use is included with this handout, so we will read the data from CoCalc. This highlights two important features of the read_csv() function. First, it can pull files from a website directly (as long as the file is formatted as a csv file). Second, it can read compressed files (note that the file name used here ends with .gz, a type of file compression). File compression is an important option because of the convenience it provides. Note that the stock data we have here would be too large to be allowed to be hosted for free on GitHub if the file were not compressed. Moreover, other cloud-based storage solutions like Amazon S3 will charge based on the amount of storage you request, so having the ability to use compressed files means you spend less money on storage costs.

The read_csv() function returns a value. The type of that value is not a number, string, list, or dictionary, but rather a pandas-specific variable type called a DataFrame. In a moment, you’ll see what one looks like. DataFrames are essentially a nested dictionary type of structure, but with a ton of convenience features added in to make them easy to work with. For now, use the .read_csv() function and then tell Python that the returned value should be set to a variable named data.

data = pd.read_csv('https://raw.githubusercontent.com/learning-fintech/data/master/stocks/all_stocks_5yr.csv.gz')

The data variable is a pandas DataFrame. Just like string variables have the .upper() function to translate them into all upper case, and dictionary variables have the .keys() function to get the keys (words) in the dictionary, DataFrame variables have some special functions specific to them. One of these is called .head(), which prints out a snippet of the data file.

data.head()
date open high low close volume Name ret Mkt SMB HML RMW CMA RF
0 2013-02-08 45.07 45.35 45.00 45.08 1824755 A NaN 0.58 0.03 -0.21 0.12 -0.45 0.0
1 2013-02-11 45.17 45.18 44.45 44.60 2915405 A -0.010648 -0.08 -0.03 0.39 -0.18 0.28 0.0
2 2013-02-12 44.81 44.95 44.50 44.62 2373731 A 0.000448 0.16 0.26 0.71 -0.54 0.45 0.0
3 2013-02-13 44.81 45.24 44.68 44.75 2052338 A 0.002913 0.14 0.08 -0.01 -0.01 -0.04 0.0
4 2013-02-14 44.72 44.78 44.36 44.58 3826245 A -0.003799 0.11 0.32 0.12 -0.30 -0.21 0.0

In the printout, names are marked in bold font. The csv file has column headers ['date', 'open', 'high', 'low', 'close', 'volume', 'Name', 'ret', 'Mkt', 'SMB', 'HML', 'RMW', 'CMA', 'RF'] in the first row of the file (this would be row 1 if we opened the csv file in Excel). The pandas function read_csv() by default will take this first row to be the names of the columns. This allows us to reference the column of data:

data['close']
0         45.08
1         44.60
2         44.62
3         44.75
4         44.58
          ...  
619035    77.82
619036    76.78
619037    73.83
619038    73.27
619039    73.86
Name: close, Length: 619040, dtype: float64

Here, Python smartly prints out only a small subset of the 'close' column of data. The first five rows (row numbers 0-5) are printed, as well as the last five rows (row numbers 619035-619039).

Be careful, names are given as strings. Python is fine with data['close'], but if you type data[close] then Python will think you’re trying to use a variable close to store the column name. For example if one were to type

close = 'open'
data[close]

then Python would give back the data in column 'open'.

Note that there are bold numbers 0-4 running down the left hand side of the .head() printout. This is because the first row of data (row 2, in Excel) is read in an by default given the name 0 by Python (because Python starts counting at zero). Hence, we could refer to a row based on this name, just like we do with a column name. Note that if you type data[0], Python will get confused. The name inside square brackets is, according to pandas’ rules for DataFrame variables, expected to be a column name and not a row name. For the moment, let’s skip over how to refer to the first row of data by using the name 0 printed on the left hand side of the printout above.

A very typical use case with DataFrames is to look up a row (or set of rows) based on a value stored in one of the columns of the DataFrame. For example, we can select a row based on the value for date. This would be achieved by entering:

data[ data['date'] == '2013-02-14' ]
date open high low close volume Name ret Mkt SMB HML RMW CMA RF
4 2013-02-14 44.7200 44.7800 44.3600 44.5800 3826245 A -0.003799 0.11 0.32 0.12 -0.3 -0.21 0.0
1263 2013-02-14 14.9400 14.9600 13.1600 13.9900 31879900 AAL -0.045703 0.11 0.32 0.12 -0.3 -0.21 0.0
2522 2013-02-14 78.6600 79.7200 78.5850 78.8400 1005376 AAP -0.001646 0.11 0.32 0.12 -0.3 -0.21 0.0
3781 2013-02-14 66.3599 67.3771 66.2885 66.6556 88809154 AAPL -0.000899 0.11 0.32 0.12 -0.3 -0.21 0.0
5040 2013-02-14 35.0500 36.7300 35.0100 36.5700 7570052 ABBV 0.036859 0.11 0.32 0.12 -0.3 -0.21 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
612749 2013-02-14 28.2200 28.6150 28.2050 28.4700 927050 XYL 0.007431 0.11 0.32 0.12 -0.3 -0.21 0.0
614008 2013-02-14 64.2700 64.5360 63.7700 63.8900 3949356 YUM -0.008073 0.11 0.32 0.12 -0.3 -0.21 0.0
615267 2013-02-14 75.8600 76.4000 75.7500 76.3400 860104 ZBH 0.004474 0.11 0.32 0.12 -0.3 -0.21 0.0
616526 2013-02-14 24.6800 24.7850 24.5100 24.6300 2789414 ZION -0.004446 0.11 0.32 0.12 -0.3 -0.21 0.0
617785 2013-02-14 33.3900 33.4500 32.7700 33.2700 2954324 ZTS -0.008346 0.11 0.32 0.12 -0.3 -0.21 0.0

476 rows × 14 columns

At the bottom of this truncated prinout, Python informs us that there are 474 rows of data. That is, we have 474 different stocks with return data for February 14, 2013.

Let’s break down what the line of code does:

\(\color{red}{\texttt{data[}}\color{blue}{\texttt{data['date']}}\color{black}{\textbf{==}\texttt{'2013-02-14'}}\color{red}{\texttt{]}}\)

Work from the inside (stuff inside the red text) out.

  • \(\color{blue}{\texttt{data['date']}}\): look up column of data named date

  • \(\textbf{==}\texttt{'2013-02-14'}\): check whether each row in the column named date equals '2013-02-14'

Thus, \(\color{blue}{\texttt{data['date']}}\color{black}{==\texttt{'2013-02-14'}}\) compares each row in date to the value '2013-02-14'. Rows whether the value in date match this string are marked True, while rows that do not match this string are marked False. Suppose that this list of rows looks like \(\texttt{[True, False, False, True, ...]}\).

Then, \(\color{red}{\texttt{data[}}\color{black}{\texttt{[True, False, False, True, ...]}}\color{red}{\texttt{]}}\) would select the rows that are marked True and ignore the rows marked False.

As we saw in earlier, we can test for whether the date in a given row matches our day of interest with the python statement: ==. Recall:

x = 1
y = 2

if we want to test whether x is equal to y, we should not use

x = y

because Python will interpret this as setting x equal to whatever value y is. We should instead use

x == y

which in this case will return False, because x is currently set equal to 1 and y is set equal to 2.

Therefore, if we issue the command:

data['date'] == '2013-02-14'

what Python will do is check each row for whether the date in that row is equal to the value '2013-02-14'. Again, be careful about quotations!

  • data['date'] == '2013-02-14' : Python checks for whether date is equal to the string value '2013-02-14'

  • data['date'] == 2013-02-14 : Python checks for whether date is equal to some integer \(2013-02-14\), and ends up confused because 02 is not a number and thus not something can can be subtracted from 2013.

data['date'] == '2013-02-14'
0         False
1         False
2         False
3         False
4          True
          ...  
619035    False
619036    False
619037    False
619038    False
619039    False
Name: date, Length: 619040, dtype: bool

Given this list of Trues and Falses, we select from data those marked True. If we take our DataFrame \(\color{red}{\texttt{data}}\) and give it a list of rows inside square brackets \(\color{red}{\textbf{[}\text{ }\textbf{]}}\), Python will only use rows marked True and it will exclude rows marked False.

Properties of the DataFrame variable type will be introduced as they become necessary. For now, return to the problem of calculating an excess return and a market risk premium. Recall that the nested dictionary solution to this problem was tedious. In contrast, the DataFrame solution is to simply type:

data['r_excess'] = data['ret'] - data['RF']

Short, simple, and easily readible. To verify that the solution is correct, print the data:

data.head()
date open high low close volume Name ret Mkt SMB HML RMW CMA RF r_excess
0 2013-02-08 45.07 45.35 45.00 45.08 1824755 A NaN 0.58 0.03 -0.21 0.12 -0.45 0.0 NaN
1 2013-02-11 45.17 45.18 44.45 44.60 2915405 A -0.010648 -0.08 -0.03 0.39 -0.18 0.28 0.0 -0.010648
2 2013-02-12 44.81 44.95 44.50 44.62 2373731 A 0.000448 0.16 0.26 0.71 -0.54 0.45 0.0 0.000448
3 2013-02-13 44.81 45.24 44.68 44.75 2052338 A 0.002913 0.14 0.08 -0.01 -0.01 -0.04 0.0 0.002913
4 2013-02-14 44.72 44.78 44.36 44.58 3826245 A -0.003799 0.11 0.32 0.12 -0.30 -0.21 0.0 -0.003799