Selecting Pieces of DataFrames¶
As stated above, a very common use case with DataFrames is to select a row or subset of rows that satisfy a propery, as we did with data[data['date'] == '2013-02-14'
.
There are two other ways to refer to a piece of a DataFrame.
Option 1: refer to rows and columns by their numbering
For example, data.iloc[0]
gives you the 0th row:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/learning-fintech/data/master/stocks/all_stocks_5yr.csv.gz')
data.iloc[0]
date 2013-02-08
open 45.07
high 45.35
low 45.0
close 45.08
volume 1824755
Name A
ret NaN
Mkt 0.58
SMB 0.03
HML -0.21
RMW 0.12
CMA -0.45
RF 0.0
Name: 0, dtype: object
And likewise data.iloc[0:2]
gives you rows 0 and 1.
data.iloc[0:2]
date | open | high | low | close | volume | Name | ret | Mkt | SMB | HML | RMW | CMA | RF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 45.07 | 45.35 | 45.00 | 45.08 | 1824755 | A | NaN | 0.58 | 0.03 | -0.21 | 0.12 | -0.45 | 0.0 |
1 | 2013-02-11 | 45.17 | 45.18 | 44.45 | 44.60 | 2915405 | A | -0.010648 | -0.08 | -0.03 | 0.39 | -0.18 | 0.28 | 0.0 |
Here we see an example of a quirk like what we experienced with range(k,n)
earlier. Recall that range(k,n)
will give you the numbers: k, k+1, k+2, …, n-1. For example range(0,2)
gives you: 0, 1 (but not 2!). Likewise, if we want to refer to a range of numbers within a list, set of rows in a dataframe, etc. we can use the k:n
to get elements k through n-1. For example, suppose that we have a shopping list:
shopping_list = ['apples', 'bananas', 'peas']
and, since everybody hates peas, we want to exclude that from our list:
shopping_list[0:2]
would select only the fruits.
To get refer to column indicies rather than row numbers, we need to use the fact that DataFrames have a [row,column] organization to them. So if you want to get row 0, column 2, you would type
data.iloc[0:2,5]
0 1824755
1 2915405
Name: volume, dtype: int64
Note that if we want to select every column of row 0, using the command
data.iloc[0]
works because Python understand that the 0 refers to the row number, and by default it assumes that you would want all the columns. If instead you want to select every row of column 0, we need to do
data.iloc[:,0]
Here, in the row-specification, the statement :
is a shortcut for saying “all the rows”. That is, if instead of writing k:n
you instead just write :
, Python understands this to mean that k
should be 0 and n
should be the biggest possible number that would make sense (i.e. the number of rows).
Option 2: refer to rows and columns by their names
Return once again to the first snapshot of this data:
data.head()
date | open | high | low | close | volume | Name | ret | Mkt | SMB | HML | RMW | CMA | RF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 45.07 | 45.35 | 45.00 | 45.08 | 1824755 | A | NaN | 0.58 | 0.03 | -0.21 | 0.12 | -0.45 | 0.0 |
1 | 2013-02-11 | 45.17 | 45.18 | 44.45 | 44.60 | 2915405 | A | -0.010648 | -0.08 | -0.03 | 0.39 | -0.18 | 0.28 | 0.0 |
2 | 2013-02-12 | 44.81 | 44.95 | 44.50 | 44.62 | 2373731 | A | 0.000448 | 0.16 | 0.26 | 0.71 | -0.54 | 0.45 | 0.0 |
3 | 2013-02-13 | 44.81 | 45.24 | 44.68 | 44.75 | 2052338 | A | 0.002913 | 0.14 | 0.08 | -0.01 | -0.01 | -0.04 | 0.0 |
4 | 2013-02-14 | 44.72 | 44.78 | 44.36 | 44.58 | 3826245 | A | -0.003799 | 0.11 | 0.32 | 0.12 | -0.30 | -0.21 | 0.0 |
and observe that the columns have names: date
, open
, high
, low
, close
, volume
, and Name
. Note that these names are case sensitive so that the last column can be accessed by typing ret['Name']
but not ret['name']
. This dataset that we loaded does not have row names, so by default Pandas will give each row a name equal to its row numer: 0, 1, 2, etc.
Whereas ret.iloc[:,0]
gave you all the data in column 0, we could instead get all the data from column date
with ret.loc[:,'date']
. The shortcut for this, since it is the usual way we slice data, is to type ret['date']
.
The first date in the dataset can be accesed with row name 0
and column name date
by typing:
data.loc[0, 'volume']
1824755