Optimize Data Science Models with Feature Engineering: Cluster Analysis, Metrics Development, and PCA with Baby Names Data

While baby name articles are mandatory reading for soon to be parents, the U.S. Social Security’s (SSA’s) Baby Names data set should be a required for budding data scientists. The data set can be sliced and diced in many different ways, including language and time based methods, and answer creative questions. Other examples of frequency data set are everywhere. For example, website analytics tools counts the number of visits by unique users, retail point of sale systems count products sold by color, banker track the number of loans defaulted in a month, and marketers survey customer satisfaction. This article is a feature engineering tutorial on frequency data set. Here, the goal of feature engineering is used to distill characteristics of each name and capture relationships between name trends into a matrix of values.

I started exploring name trends to find ways to make name recommendations based on a few preferences. As a starting point, I assumed that trends captures the qualities and societal attitudes. Interestingly, baby name trends are able to explain a lot more about societal preferences, see Cross-Correlations of Baby Names NCBI Article. Also, when you know someone’s name then you likely know that person’s age, see How to Tell Someone’s Name When all you Know Is Her Name, FiveThirtyEight Article

What U.S. names are most similar to my own? My name (“Pauline”) is old fashion without much room for abbreviation. I assumed the following list of features based on time and manually pulled my assumptions from the SSA data, aka “manual” features. Then I applied principal component analysis (PCA) to create a set of “automated” features. Finally, cluster analysis results are compared using these automated, manual, and combined features. Clustering is a way to group sets of objects together with similar attributes. Data science methods have a solid track record on pooling together similar things. For example, product recommendation algorithms identify people like you by purchasing history and/or demographics or determine products that are most commonly purchased with each other (i.e. ketchup with fries).

List of features to find names similar to “Pauline”:

  1. Similar names will not not at or near it’s peak of popularity (peak detection),
  2. Similar names are relatively obscure in the United States (quantity and acceleration),
  3. Similar names are not too unique (quantity), and
  4. Similar names is subjectively pleasant (not quantifiable).
  5. The data set, Python code, and analysis are available in a public and interactive Kaggle notebook: https://www.kaggle.com/paulinechow/baby-names-optimize-w-feature-engineering

What features will generate the “best” short list of baby names?

Baby Name Metrics: peak detection, acceleration, and rank

SSA’s name data tracks the frequency of names used from the years 1910 to 2017. Name frequencies are aggregated and grouped by year and gender. In the notebook, sections 1 to 3 are general checks of data, including sampling of the data, test statistics, size, and shape. The notebook walks through creating categorical metrics based on peak popularity, year over year change, and appearance in the top 500 most popular names in the last 3 years. Section 4 of the notebook goes through the steps to create, combine, and analyze these metrics.

(1) Peak popularity detection

Each name has highs (“peaks”) and lows (“valleys”) compared to itself and and globally to provide important information. For instance, the names “Bertha” and “Jenny” reached peak popularity in 1920s and 1970s, respectively, and since decreased in popularity. Bertha has had a steady decline since it’s peak, while Jenny and Jennifer were strong contenders between 1940s to 1970s before its steady descent.

for names of her peers that will be popular in the lifetime of my child. A hypothesis is that names will not likely be popular again if it peaked significantly relative to the population before the current cohort and currently on a decelerating trend. Peak detection combined with knowing the current acceleration or YoY (year over year) change can help narrow down names that meet the current requirements.

Peak detection also returns information to create metrics.

Bertha name trend line Jenny name trend line

Peak detection is used in digital signal processing and speech recognition to find local minima and maxima both in fixed and real time data. With names, peak detection can contextualize names with respect to events, people, and culture. Here, peak detection is leveraged to determine any peak(s) within the last 5, 10, 15, 20, and 25 years, which are saved as categorical features in the dataset.

In this notebook, peaks and valleys are detected with a simple and complex method.

(a) The most straight forward approach to peak detection is to calculate sign changes between consecutive periods. A sign change between two periods from positive to negative would denote a decrease from a peak. The simple algorithm returns the index of decreases compared with previous element. The input data for the peak_detection_simple function are a list of values, such as yearly or 5 year rolling averages. The calculations for this list is completed before the function returns indices.

The simple peak detection function below computes differences between consecutive time periods and returns the number of sign changes. Sign changes are defined as movement from positive to negative and does not differentiate between magnitude or length of time at a peak.

def peak_dectection_simple(trend):

    ch = [x1-x0 for x0,x1 in zip(trend,trend[1:]) if x1!=x0]
    return sum([1 for c in ch if ch < 1 else 0])

This simple method lacks the ability to look at the big picture trend of a name. Questions arise from the results of the simple function: What if a name spends more time at a “peak”? Should we aggregate similar peaks that are pretty close? What fluctuation from positive to negative a peak are significant? Is there a threshold for the slope of an incline to or descent from the peak?

(b) Scipy is an open-source scientific computing package that provides built in functions for mathematics, science, and engineering. The package provides functions for identifying peaks and with additional parameters can differentiate further, see scipy.signal.find_peaks. The scipy find_peaks function provides options to define the absolute minimum and maximum of peaks (height), set a minimum vertical (threshold) and horizontal (distance) measurement of peaks, and relative strength of the peak (prominence).

#example function that leverages the scipy find_peaks function
def get_peaks(df, name, d, verbose=False):
    '''
    Function takes a dataframe, name (string), d (distance of peak)
    Returns index or years of applicable time frames
    '''
    df_filter = df.loc[:, name]
    tvec = np.array(df_filter) 
    indexes, _ = scipy.signal.find_peaks(tvec, height=float(tvec.mean()), distance=d)
    if verbose: 
        print('Peaks are: %s' % (indexes))
        print(tvec[indexes])
    return indexes

(2) Acceleration or YoY change Year over year (or over any time period) metrics are standard in analytics and reporting contexts. The longer the time period compared, the more seasonality factors are normalized in the outputs. In python pandas, calculating percentage change between x number of years creates a proxy for the acceleration rate with the last x years. A number of features are created for the names based over various periods of time.

# set the shift to any number of years to compare year over year growth of x years
yoy_female = df_female_pivot.apply(lambda x: (x - x.shift(1)) / x, axis=0)

(3) Top 500 ranked name indicator

A categorical variable is created to flag if a name was ranked in the top 500 list over the last 3 years. The indicator is a way to prevent attributing too much weight by using actual ranks even if we scale this number. Other ways this indicator can be changed to aggregate over more or less years, collect all top X names for every year, and rank over names grouped by state.

#Create metric with reference to today's date
now = datetime.datetime.now()
last_3_yrs = list(range(now.year - 3,now.year))
count_3_yrs = df_female_pivot.loc[last_3_yrs, :]
df_3_yrs = pd.DataFrame(count_3_yrs.sum().reset_index())
df_3_yrs.columns= ['names', 'count']
df_3_yrs.set_index('names', inplace=True)
df_3_yrs['rank_3yr'] = df_3_yrs.rank(ascending=False)

A list of the top 500 names over the last 3 years will come in handy later for filtering names from the final list.

top_500_list = df_3_yrs[df_3_yrs.loc[:, 'rank_3yr'] <= 500].index

Features from Principle Component Analysis (PCA)

Name popularity is hard to predict solely based on frequency. A pattern is not necessarily discernible because inspiration is random. Parents may be influenced by Disney movies, public figure, or private event in their lives. A study that uses baby names as indicators of cultural traits in the United States, shows new names are being invention versus using names from past generations. Cross-Correlations of Baby Names Instead of manually extracting metrics, the entire trends can be decomposed into features that explain the variance of each trend. PCA transforms the The math behind PCA is explained here.

Disney Princess Names with Corresponding Release Years

Running PCA with 25 components, the results show that 3 and 10 components cumulatively explains 80% and 99% of variance in the data, respectively. A threshold for cumulative variance can be set before the results of PCA, especially if the dataset is very large. Computation time can be saved by leveraging PCA as a dimension reduction technique. Alternatively, the number of components chosen for subsequent analysis can be dependent on the outcome of the final model. This means that a model can be further optimized by data inputs, in this case it would be the number of components.

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

#tmp is dataframe
X_pca = np.array(sc.fit_transform(tmp)) 
pca_check = PCA(n_components=25)
pca_check.fit(X_pca)

#examine the cumulative variance explained by components
print(np.cumsum(pca_check.explained_variance_ratio_))
output: [0.42887293 0.66402468 0.80200918 0.88461254 0.93056296 0.95413504
 0.96839219 0.97788769 0.98364119 0.98681891 0.98958087 0.99171796
 0.99302033 0.99415808 0.99509189 0.99571205 0.99626306 0.99665273
 0.9970096  0.99730653 0.99758265 0.99781979 0.99798878 0.99813573
 0.99825983]

Section 5 of the notebook transforms the features and implements cluster analysis for 4 and 10 components. In this analysis, accounting for more variance between names is able to more cleanly partition names into clusters. The more component leveraged here, the better the final cluster silhouette scores. Optimizing the results for quality of clustering is most aligned with the desired outcome.

Finding clusters with Kmeans

I chose Kmeans cluster analysis for a more generalized method of groupings of names. Kmeans clustering is used to to group names (observations) into n clusters, where each name is allocated to the cluster with the nearest mean. The quality of clusters can measured with silhouette score, ranging from -1 to +1, which determines cohesion of observations within its own cluster and compared with other clusters.

def get_cluster_score(data): 
    # outputs silhoutte_score for x number of clusters
    for n in range(2, 11):
        kmeans = KMeans(n_clusters=n).fit(data)
        label = kmeans.labels_
        sil_coeff = silhouette_score(data, label, metric='euclidean')
        print("For n_clusters={}, The Silhouette Coefficient is {}".format(n, sil_coeff))

The number of clusters for manual, automatic, and combined datasets were selected based on the count of names in the same cluster as Pauline.

Results Baby Name Lists

The baby name list results were promising since we are able to go from thousands to less than 200 names. Lists generated by clustering with manual and automatic features contains 155 and 57 names, respectively. There are 37 names shared by both lists. PCA features identified a shorter list of names and, from the graph below, follows the assumptions laid out originally. At the same time, only using automatically created features subjectively “misses” potential names.
A blended dataset, aggregating manual and automatic features, produced the shortest list with 47 names.

A lesson from this comparison is that feature engineering can be both art and science. Cluster analysis will produce groupings that meet requirements and the more requirements means the more restrictive the groupings. Automatic features are able to mirror trend similarities over time more accurately. Manual features will capture internalized rules or assumptions but is not a guarantee to remove noisy results. When solely relying on automatic features then creativity may be lost. Blending the data and adding more features filter names further instead of infusing creativity into the list. The blended data produced a list with less noise, while the manual features only did not do enough to meet the requirements. Below are sample of names generated from using different set of features.

The full lists from both set of features can be downloaded from the Jupyter notebook. Below are 10 randomly selected names from the lists:

# Manual features only
['Christina',
 'Kelsey',
 'Sophia',
 'Mia',
 'Sharon',
 'Hazel',
 'Rachel',
 'Maria',
 'Brooklyn',
 'Emily']
# Automatic features only
['Virginia',
 'Gloria',
 'Marjorie',
 'Esther',
 'Lucille',
 'Thelma',
 'Agnes',
 'Norma',
 'Rose',
 'Ruby']
#Names appearing in both manual and automatic
['Marie',
 'Jane',
 'Thelma',
 'Lillian',
 'Geraldine',
 'Catherine',
 'Mildred',
 'Jean',
 'Florence',
 'Ruby']
#names appearing in combined feature list
['Maria',
 'Kim',
 'Diane',
 'Cheryl',
 'Jill',
 'Janice',
 'Laura',
 'Brenda',
 'Rhonda',
 'Kathy']
#names appearing in combined and manual
#there were no intersections between combined and automatic
['Jacqueline', 'Laurie', 'Sheila', 'Sherry', 'Suzanne', 'Wendy']

Further questions that you can ask about the data:

  • What other metrics can we derive from the SSA dataset?
  • Do the similar trends and insights follow with male names in the SSA dataset?
  • What attributes of cluster analysis can be optimized to find similar names? If connectivity or distribution based clustering are used instead of centroid-based clustering.

Written with StackEdit.

Leave a Reply

Your email address will not be published.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑