Selecting Pieces of DataFrames

As stated above, a very common use case with DataFrames is to select a row or subset of rows that satisfy a propery, as we did with data[data['date'] == '2013-02-14'.

There are two other ways to refer to a piece of a DataFrame.

Option 1: refer to rows and columns by their numbering

For example, data.iloc[0] gives you the 0th row:

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/learning-fintech/data/master/stocks/all_stocks_5yr.csv.gz')

data.iloc[0]
date      2013-02-08
open           45.07
high           45.35
low             45.0
close          45.08
volume       1824755
Name               A
ret              NaN
Mkt             0.58
SMB             0.03
HML            -0.21
RMW             0.12
CMA            -0.45
RF               0.0
Name: 0, dtype: object

And likewise data.iloc[0:2] gives you rows 0 and 1.

data.iloc[0:2]
date open high low close volume Name ret Mkt SMB HML RMW CMA RF
0 2013-02-08 45.07 45.35 45.00 45.08 1824755 A NaN 0.58 0.03 -0.21 0.12 -0.45 0.0
1 2013-02-11 45.17 45.18 44.45 44.60 2915405 A -0.010648 -0.08 -0.03 0.39 -0.18 0.28 0.0

Here we see an example of a quirk like what we experienced with range(k,n) earlier. Recall that range(k,n) will give you the numbers: k, k+1, k+2, …, n-1. For example range(0,2) gives you: 0, 1 (but not 2!). Likewise, if we want to refer to a range of numbers within a list, set of rows in a dataframe, etc. we can use the k:n to get elements k through n-1. For example, suppose that we have a shopping list:

shopping_list = ['apples', 'bananas', 'peas']

and, since everybody hates peas, we want to exclude that from our list:

shopping_list[0:2]

would select only the fruits.

To get refer to column indicies rather than row numbers, we need to use the fact that DataFrames have a [row,column] organization to them. So if you want to get row 0, column 2, you would type

data.iloc[0:2,5]
0    1824755
1    2915405
Name: volume, dtype: int64

Note that if we want to select every column of row 0, using the command

data.iloc[0]

works because Python understand that the 0 refers to the row number, and by default it assumes that you would want all the columns. If instead you want to select every row of column 0, we need to do

data.iloc[:,0]

Here, in the row-specification, the statement : is a shortcut for saying “all the rows”. That is, if instead of writing k:n you instead just write :, Python understands this to mean that k should be 0 and n should be the biggest possible number that would make sense (i.e. the number of rows).

Option 2: refer to rows and columns by their names

Return once again to the first snapshot of this data:

data.head()
date open high low close volume Name ret Mkt SMB HML RMW CMA RF
0 2013-02-08 45.07 45.35 45.00 45.08 1824755 A NaN 0.58 0.03 -0.21 0.12 -0.45 0.0
1 2013-02-11 45.17 45.18 44.45 44.60 2915405 A -0.010648 -0.08 -0.03 0.39 -0.18 0.28 0.0
2 2013-02-12 44.81 44.95 44.50 44.62 2373731 A 0.000448 0.16 0.26 0.71 -0.54 0.45 0.0
3 2013-02-13 44.81 45.24 44.68 44.75 2052338 A 0.002913 0.14 0.08 -0.01 -0.01 -0.04 0.0
4 2013-02-14 44.72 44.78 44.36 44.58 3826245 A -0.003799 0.11 0.32 0.12 -0.30 -0.21 0.0

and observe that the columns have names: date, open, high, low, close, volume, and Name. Note that these names are case sensitive so that the last column can be accessed by typing ret['Name'] but not ret['name']. This dataset that we loaded does not have row names, so by default Pandas will give each row a name equal to its row numer: 0, 1, 2, etc.

Whereas ret.iloc[:,0] gave you all the data in column 0, we could instead get all the data from column date with ret.loc[:,'date']. The shortcut for this, since it is the usual way we slice data, is to type ret['date'].

The first date in the dataset can be accesed with row name 0 and column name date by typing:

data.loc[0, 'volume']
1824755