American Community Survey¶

The American Community Survey(ACS) is the largest survey conducted by the U.S. Census Bureau. It collects basic demographic information about people living across the United States.

The survey is conducted every year, and, due to processing time, releases with a lag. For instance, the 2018 survey data is released in the latter part of 2019. The ACS data is relased not only in one-year estimates (e.g. 2018 survey data) but also in five-year estimates (e.g. 2014-2018 survey data). The one-year estimates are, naturally, more recent. However, five-year estimates may be necessary in some applications. The Census Bureau does not release survey information for small populations, due to anonymity concerns, every year. Rather, some information is available exclusively in the five-year survey data (which averages over a five year period).

A brief overview of the data available through the ACS is available here. Note that the ACS data includes things like age, marital status, income, employment status, and educational attainment.

The full list of ACS data is available here.

Register for an API key with the U.S. Census Bureau here. This step is required to continue with the lecture notes!

To keep my key secret, I’ve pickled the string that stores it, and will re-load it here.

import pickle
with open('../pickle_jar/census.p', 'rb') as f:
    api_key = pickle.load(f)
print(api_key[0:5])

6efdf

The formula to get data from the ACS is:

'https://api.census.gov/data/'
+ <year>
+ '/acs/acs1?get=NAME,'
+ <variable name>
+ '&for='
+ <geography>
+ ':*&key='
+ <API key>

For instance, to get median household income (which has <variable name> = B19013_001E) for <year> = 2010 over the entire U.S. (<geography> = us), we would use the string

'https://api.census.gov/data/' + '2010' + '/acs/acs1?get=NAME,' + 'B19013_001E' + '&for=' + 'us' + ':*&key=' + api_key

as our URL. An example request is shown below:

NOTE: Starting Fall 2021, we began to encounter issues with students receiving API keys from the Census Bureau that do not appear to work correctly. If the code below does not work for you, please replace it with the following Python code:

import requests
r = requests.get(‘https://raw.githubusercontent.com/learning-fintech/data/main/census/census1.json’).json()
print(r)

import requests
r = requests.get('https://api.census.gov/data/2018/acs/acs1?get=NAME,B19013_001E&for=us:*&key=' + api_key).json()
print(r)

[['NAME', 'B19013_001E', 'us'], ['United States', '61937', '1']]

Likewise, to get data for every county in the country, we would replace us with county following the for= piece of the string. This will return data on many counties. Instead of printing them all, below, we’ve printed just a sample, as well as the total number of counties for which median household income data is available via ACS.

NOTE: Starting Fall 2021, we began to encounter issues with students receiving API keys from the Census Bureau that do not appear to work correctly. If the code below does not work for you, please replace it with the following Python code:

import requests
r = requests.get(‘https://raw.githubusercontent.com/learning-fintech/data/main/census/census2.json’).json()
print(r)

r = requests.get('https://api.census.gov/data/2018/acs/acs1?get=NAME,B19013_001E&for=county:*&key=' + api_key).json()
print(r[0:10])

[['NAME', 'B19013_001E', 'state', 'county'], ['Baldwin County, Alabama', '56813', '01', '003'], ['Calhoun County, Alabama', '45818', '01', '015'], ['Cullman County, Alabama', '44612', '01', '043'], ['DeKalb County, Alabama', '36998', '01', '049'], ['Elmore County, Alabama', '60796', '01', '051'], ['Etowah County, Alabama', '45868', '01', '055'], ['Houston County, Alabama', '48105', '01', '069'], ['Jefferson County, Alabama', '55206', '01', '073'], ['Lauderdale County, Alabama', '49014', '01', '077']]

Note that the first item in this list is:

r[0]

['NAME', 'B19013_001E', 'state', 'county']

which is simply a set of headers. That is, for each additional item in the list, the data in position 0 is the county name and the data in position 1 corresponds to the value for the variable B19013_001E. The numbers in positions 2 and 3 are the county’s FIPS code. States have 2-digit FIPS codes, and counties have 3-digit FIPS codes. If we put the state code and county code together to a 5-digit number (state code then county code), we have a unique identifier for the county. This FIPS code is used by many different data providers.

Recall that we can remove an element from a list with the .pop() function. For instance, let’s remove the first element of our requested ACS data (the set of headers).

headers = r.pop(0) # delete item 0 from the list and simultaneously store it in the variable "headers"

Do not run the above line of code multiple times! Python will continue popping out elements of your list.

The value for headers is thus:

print(headers)

['NAME', 'B19013_001E', 'state', 'county']

as expected. Now, the requested ACS data no longer has that element. Rather, what is left is

print(r[0:3])

[['Baldwin County, Alabama', '56813', '01', '003'], ['Calhoun County, Alabama', '45818', '01', '015'], ['Cullman County, Alabama', '44612', '01', '043']]

simply data. This is useful, because we can not store all of the data we have in a DataFrame. The command to do this is pd.DataFrame(). This function takes two arguments. First, Python expects to receive a list of lists. This list of lists should correspond to the data that we want to store. The list of list is a list of rows of data, where each row may be a list of multiple values (multiple columns). The second argument is a list of column names.

import pandas as pd
census = pd.DataFrame(r, columns=headers)
census.head()

	NAME	B19013_001E	state	county
0	Baldwin County, Alabama	56813	01	003
1	Calhoun County, Alabama	45818	01	015
2	Cullman County, Alabama	44612	01	043
3	DeKalb County, Alabama	36998	01	049
4	Elmore County, Alabama	60796	01	051

Note that the state and county variables display with leading zeroes. That is, a number 1 is printed out at 01. This is a telltale sign that these columns are strings.

We will want these columns to be integers. The reason why will become apparent once we load another dataset. However, it’s convenient for us to convert the data type now. Remember that we could do int('01') to convert one string to one integer. In pandas, we can use the .astype() function on a column of string data to convert all rows of that column.

census['state'] = census['state'].astype(int)
census['county'] = census['county'].astype(int)

We can check the variable types formally with the .dtypes command.

census.dtypes

NAME           object
B19013_001E    object
state           int64
county          int64
dtype: object

Oh no! It looks like B19013_001E (the Census variable for median household income) is also a string (which pandas lumps into a group called object). So, we’ll want to convert median household income to an integer as well. This will allow us to use it in a regression later.

census['B19013_001E'] = census['B19013_001E'].astype(int)

	RegionID	SizeRank	RegionName	RegionType	StateName	State	Metro	StateCodeFIPS	MunicipalCodeFIPS	2018-12-31
0	3101	0	Los Angeles County	County	CA	CA	Los Angeles-Long Beach-Anaheim	6	37	633036.0
1	139	1	Cook County	County	IL	IL	Chicago-Naperville-Elgin	17	31	256785.0
2	1090	2	Harris County	County	TX	TX	Houston-The Woodlands-Sugar Land	48	201	197363.0
3	2402	3	Maricopa County	County	AZ	AZ	Phoenix-Mesa-Scottsdale	4	13	273091.0
4	2841	4	San Diego County	County	CA	CA	San Diego-Carlsbad	6	73	593304.0

	RegionID	SizeRank	RegionName	RegionType	StateName	State	Metro	StateCodeFIPS	MunicipalCodeFIPS	2018-12-31	NAME	B19013_001E	state	county
0	3101	0	Los Angeles County	County	CA	CA	Los Angeles-Long Beach-Anaheim	6	37	633036.0	Los Angeles County, California	68093	6	37
1	139	1	Cook County	County	IL	IL	Chicago-Naperville-Elgin	17	31	256785.0	Cook County, Illinois	63353	17	31
2	1090	2	Harris County	County	TX	TX	Houston-The Woodlands-Sugar Land	48	201	197363.0	Harris County, Texas	60232	48	201
3	2402	3	Maricopa County	County	AZ	AZ	Phoenix-Mesa-Scottsdale	4	13	273091.0	Maricopa County, Arizona	65252	4	13
4	2841	4	San Diego County	County	CA	CA	San Diego-Carlsbad	6	73	593304.0	San Diego County, California	79079	6	73

Financial Modeling and Analytics Using Python

American Community Survey¶

Zillow Home Values¶