Tutorial: Accessing and Exploring the Data

This tutorial walks through fetching, cleaning, and doing initial exploration of the Utah housing affordability dataset.


Installation

Install the utah-housing package and set your Census API key before running any code cells.

pip install utah-housing

Set your Census API key (obtain a free key at https://api.census.gov/data/key_signup.html):

import os
os.environ["CENSUS_API_KEY"] = "your_census_api_key_here"

Reading in the Data

import pandas as pd
from utah_housing import fetch_all_years, OUTCOME, BASE_PREDICTORS, COMPLEX_PREDICTORS

# Fetch ACS 5-year estimates for Utah census tracts, 2009–2023
df = fetch_all_years(years=range(2009, 2024))
df.head()
# Save to CSV so you can reload without hitting the API again
df.to_csv("data/utah_housing.csv", index=False)
# Reload from CSV (faster than re-fetching)
df = pd.read_csv("data/utah_housing.csv")
df.head()

Initial Exploration

print("Shape:", df.shape)
print("\nOutcome variable:", OUTCOME)
print("Base predictors:", BASE_PREDICTORS)
print("Complex predictors:", COMPLEX_PREDICTORS)
# Check for missing values in key columns
analysis_cols = ["year", "GEOID"] + [OUTCOME] + COMPLEX_PREDICTORS
df[analysis_cols].isnull().sum()

Cleaning the Data (not necessary for analysis with package)

# Drop rows missing the outcome variable
df_clean = df.dropna(subset=[OUTCOME]).copy()

# Extract county name from the NAME field for easier filtering
df_clean["county_name"] = df_clean["NAME"].str.extract(r",\s*(.+?)\s+County", expand=False)

print(f"Rows after cleaning: {len(df_clean):,}")
df_clean.head()

Subsetting by County or Year

# Filter to a specific county
salt_lake = df_clean[df_clean["county_name"] == "Salt Lake"]
print(salt_lake.shape)

# Filter to a specific year
df_2023 = df_clean[df_clean["year"] == 2023]
print(df_2023.shape)

Basic Summary Statistics

summary_cols = [OUTCOME] + COMPLEX_PREDICTORS
df_clean[summary_cols].describe().round(2)

The data is now ready for EDA and modeling (see the Technical Report).