Selecting Pieces of DataFrames¶
As stated above, a very common use case with DataFrames is to select a row or subset of rows that satisfy a propery, as we did with
data[data['date'] == '2013-02-14'.
There are two other ways to refer to a piece of a DataFrame.
Option 1: refer to rows and columns by their numbering
data.iloc gives you the 0th row:
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/learning-fintech/data/master/stocks/all_stocks_5yr.csv.gz') data.iloc
date 2013-02-08 open 45.07 high 45.35 low 45.0 close 45.08 volume 1824755 Name A ret NaN Mkt 0.58 SMB 0.03 HML -0.21 RMW 0.12 CMA -0.45 RF 0.0 Name: 0, dtype: object
data.iloc[0:2] gives you rows 0 and 1.
Here we see an example of a quirk like what we experienced with
range(k,n) earlier. Recall that
range(k,n) will give you the numbers: k, k+1, k+2, …, n-1. For example
range(0,2) gives you: 0, 1 (but not 2!). Likewise, if we want to refer to a range of numbers within a list, set of rows in a dataframe, etc. we can use the
k:n to get elements k through n-1. For example, suppose that we have a shopping list:
shopping_list = ['apples', 'bananas', 'peas']
and, since everybody hates peas, we want to exclude that from our list:
would select only the fruits.
To get refer to column indicies rather than row numbers, we need to use the fact that DataFrames have a [row,column] organization to them. So if you want to get row 0, column 2, you would type
0 1824755 1 2915405 Name: volume, dtype: int64
Note that if we want to select every column of row 0, using the command
works because Python understand that the 0 refers to the row number, and by default it assumes that you would want all the columns. If instead you want to select every row of column 0, we need to do
Here, in the row-specification, the statement
: is a shortcut for saying “all the rows”. That is, if instead of writing
k:n you instead just write
:, Python understands this to mean that
k should be 0 and
n should be the biggest possible number that would make sense (i.e. the number of rows).
Option 2: refer to rows and columns by their names
Return once again to the first snapshot of this data:
and observe that the columns have names:
Name. Note that these names are case sensitive so that the last column can be accessed by typing
ret['Name'] but not
ret['name']. This dataset that we loaded does not have row names, so by default Pandas will give each row a name equal to its row numer: 0, 1, 2, etc.
ret.iloc[:,0] gave you all the data in column 0, we could instead get all the data from column
ret.loc[:,'date']. The shortcut for this, since it is the usual way we slice data, is to type
The first date in the dataset can be accesed with row name
0 and column name
date by typing: