While baby name articles are mandatory reading for soon to be parents, the U.S. Social Security’s (SSA’s) Baby Names data set should be a required for budding data scientists. The data set can be sliced and diced in many different ways, including language and time based methods, and answer creative questions. Other examples of frequency data set are everywhere. For example, website analytics tools counts the number of visits by unique users, retail point of sale systems count products sold by color, banker track the number of loans defaulted in a month, and marketers survey customer satisfaction. This article is a feature engineering tutorial on frequency data set. Here, the goal of feature engineering is used to distill characteristics of each name and capture relationships between name trends into a matrix of values. Continue reading “Optimize Data Science Models with Feature Engineering: Cluster Analysis, Metrics Development, and PCA with Baby Names Data”