The Problem of Data Aggregation in People Metrics

This analysis is a reaction to Kansas City Star article, “Asian-Americans narrow wealth gap, new studies show,” which oversimplifies income and race trends. It aggregates “Asian-Americans” into a group and tells the story of averages. This is not uncommon in major coverage of demographics and Asian Americans. In demonstrating issues with disaggregation, data from U.S. Census dataset from UCI Machine Learning Library, here and here, are compared with the findings from a St. Louis (STL) Federal Reserve paper on The Demographics of Wealth. Demographic data aggregation tells the wrong story of income and race in the United States. There are cases where metrics should be aggregated but in those cases the advantages must be laid out.

An Obama White House paper nicely lays out the “Significance of Data Disaggregation to the Asian American Pacific Islander Community,read here. In the context of data analysis, demographic aggregation should be questioned in the following ways:

  • How important are race and ethnicity when selecting features from a dataset?
    • Is there a more sophisticated data mining technique that select the feature that most impact income (or other trend)?
    • When race are significant, when should data be broken down further into ethnicity?
    • Where race and ethnicity is not significant then what other features are more influential?
  • What is the distribution of data within aggregated groups?
    • If there is a large variation with groups then aggregation is not useful or explanative. The average or median may not be representative of the variance within the grouped data. Does de-aggregation add more information to the analysis?

The STL paper, Demographics of Wealth unapologetically aggregates household race and ethnicity into four groups – white, asian, hispanic, and black. The paper glosses over history, culture, geography and resources tied with the four groups and subgroups. Oversimplifying race and ethnicity within the wealth and resource distribution is a disservice to lower income, less resourced, and community of colors. Households within each racial group are distributed across the income spectrum from poor to rich, poverty to surplus, and struggling to success. It is disingenuous to provide sweeping analysis of national wealth, which is especially when there is plenty of research that highlight nuisances in the race and ethnicity, income, and wealth:

Demographics of Wealth uses the Federal Reserve’s triennial Survey of Consumer Finances survey. I used the U.S. census dataset to review feature selection and distribution of wealth by racial and ethnic groups, cleaned dataset from the UCI Machine Learning Repository for Census Income.The UCI Machine Library datasets blends income by race is from U.S. Census Data here. A summary of each data visualization:

  • Bar Charts of the variety of income groups by race and ethnicity, gender, country of origin. If there was continuous data by race available (or if I could find it with the U.S. Census provided tools) then box plots would be a good way to visualize income distribution within each aggregated race group. From these bars, education and income vary within each ethnicity, as seen in the bar plots, however the STL Federal Reserve Asian American are doing the “best” and quickly reapplies an oversimplification to demography and trend analysis.


  • Feature (or metric) selection by decision tree to view the weight that each feature or metric has on determining the targets. In this case, the cleaned UCI dataset targets are yearly income above or below $50,000. Metrics submitted into for the classification tree model are age, sex, education, and race. The resulting trees shows the impact of each metric on determining the target, where the closer the metric is to the root/top of the tree, the more important it is to determining the target. Two examples below are different adjustments of tree sizes, depth and width of the tree, race appears after the 3rd level in the tree. Age and education are more significant features.


Disaggregation in demographics is a continuing issue in public policy, especially public education. In the United States, the political amalgamation of people with ancestry from Asia, Southeast Asia, Pacific Islands, and Eastern Asian falsely implies “similarity” between different ethnic groups. The political group itself has been rebranded many times: Asian American, Asian Pacific American, Asian American and Pacific Islander, Asian Pacific Islander American, and so on and so forth. There is an obvious disconnect between identification and categorization of over 43 different ethnicities as one. The issue is in the United States, Asian American make up about 5% of the U.S. population, so on one hand, breaking out subcategories dilutes political power. Yet, on the other hand, aggregation of Asian Americans in policy decisions causes more harm. For instance, medical research and health policy does not break out Asian American ethnicities demographies so trends, behaviors, languages, and etc. important to health and public health are missed, see more information here: National Center for Biotechnical Education and infographic medical research dedicated to U.S. ethnic groups.

Socioeconomic and demographic research attempts to explain the world and decipher how people succeed and fail. Research can facilitate discovery and highlight potential solutions, but at the same time create harmful bias against and suppress communities. The problem lies in perspectives of the researcher and data collection vehicle, such as surveys and means of collection. Just as private companies cannot respond to consumer needs and develop opportunities with erroneous and unsophisticated analysis, public sector leaders cannot make the right policy decisions with generalized information about the public and its communities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑