Understanding and predicting household income distribution is fundamental to urban planning, policy development, and socioeconomic research. Our work presents a novel approach that leverages machine learning techniques to predict household income by incorporating diverse data sources. We combine traditional demographic indicators with innovative cultural metrics to create a comprehensive prediction model.
The significance of this work lies in its potential applications for urban planning, policy development, and understanding socioeconomic disparities. By incorporating cultural infrastructure metrics alongside traditional demographic data, we provide new insights into the relationship between cultural access and economic wellbeing.
Explore Our MethodologyOur research leverages a comprehensive dataset combining cultural infrastructure, geographic coordinates, and demographic variables to predict household income across New York City. Below we outline our data collection and processing methodology.
Explore the GeoSpatial Data in DetailsWe implemented a sophisticated spatial sampling approach using a 500x500 grid system spanning the geographic bounds of NYC. This high-resolution grid enables detailed spatial analysis and captures local variations in both cultural and demographic characteristics.
We developed an innovative approach to quantify cultural accessibility using an exponential decay function. This formula incorporates distance-based influence where cultural site weights diminish with distance from grid points.
Demographic features were integrated using U.S. Census data, including population density, racial composition statistics, and socioeconomic indicators. Spatial joins matched grid points with census polygons.
Feature correlation matrix highlighting relationships between variables in the dataset.
Our analysis reveals several significant findings regarding the prediction of household income. The Random Forest model emerged as the superior performer, significantly outperforming other approaches. This exceptional performance can be attributed to the model's ability to capture complex, non-linear relationships between features.
Our Random Forest model achieved exceptional explanatory power, accounting for over 91% of the variance in household income.
The Root Mean Squared Error indicates high prediction accuracy relative to the income scale.
Mean Absolute Error demonstrates robust prediction capabilities across diverse neighborhoods.
| Model | R² Score | RMSE | MAE | Training Time (s) | Parameter Count |
|---|---|---|---|---|---|
| Linear Regression | 0.4083 | 30,118.41 | 22,650.40 | <0.01 | 17 |
| Random Forest | 0.9127 | 11,566.72 | 7,771.45 | 6.76 | 4,800 |
| Gradient Boosting | 0.6702 | 22,484.62 | 16,798.52 | <0.01 | 5,040 |
| XGBoost | 0.7517 | 19,511.74 | 14,481.06 | 0.09 | 10,800 |
| FCNN | 0.7611 | 19,136.01 | 13,580.19 | 32.86 | 3,331 |
SHAP summary plot showing feature importance for the Random Forest model.
Actual vs predicted household incomes for the Random Forest model.
R² score vs number of parameters for all models evaluated in this study (log scale).
Our interdisciplinary research team combines expertise in urban planning, machine learning, and data science to understand the complex relationships between cultural infrastructure and socioeconomic factors.
PhD Candidate
Civil and Environmental Engineering
Arizona State University
mislam23@asu.edu
PhD Student
School of Computing and Augmented Intelligence
Arizona State University
rsaha8@asu.edu
PhD Student
School of Computing and Augmented Intelligence
Arizona State University
ppranto@asu.edu