While baby name articles are mandatory reading for soon to be parents, the U.S. Social Security’s (SSA’s) Baby Names data set should be a required for budding data scientists. The data set can be sliced and diced in many different ways, including language and time based methods, and answer creative questions. Other examples of frequency data set are everywhere. For example, website analytics tools counts the number of visits by unique users, retail point of sale systems count products sold by color, banker track the number of loans defaulted in a month, and marketers survey customer satisfaction. This article is a feature engineering tutorial on frequency data set. Here, the goal of feature engineering is used to distill characteristics of each name and capture relationships between name trends into a matrix of values. Continue reading “Optimize Data Science Models with Feature Engineering: Cluster Analysis, Metrics Development, and PCA with Baby Names Data”
Common threads exists in successful students that have transitioned to data science as a career after a bootcamp program.
This blog is currently migrating from When There Is Data. Old posts will be republished, as well as new content added in the coming weeks. Stay tuned and thanks for your patience!
This post is a brief ode to the spreadsheet, which paved the way for many to learn about how to organize information, collaborate, and analyze data. Spreadsheets played and/or still play a substantial parts of our analytical life. Data scientists can get a little smug in associating with technical tools, often inclined to discuss the latest and greatest. However, spreadsheets mostly likely still rule part of the workflow. It may be quickly inspecting the data, the best way to share information with non-technical people, or an accessible way to check results.
Data science is able to detect anomalies, such as fraud and security anomalies. Can it also be used to find sneaky political practices? Quantifying and featuring-izing the “messy” world of politics can elucidate order and truth. Politics is one of the most important yet uninteresting place to the public, who have everything to gain and lose from every day decisions. Note: This post was written before Trump was elected to office…
We all should be sitting at the edge of our seats in the next couple of months. Change is inevitable but the change agent may be questionable. To get psyched for this last stretch before the elections, I apply natural language processing (NLP) on this week’s first presidential debates with a focus on polarity in sentiment. Visualizations in this post include interactive candidate polarity graph and word clouds. Continue reading “Clinton v. Trump: Candidate Sentiment from the 1st Presidential Debate”
The popularity of data science in the media makes the combination of established areas of study more accessible and interesting to everyone.
There is no denying that data science helps with online content but for many and most content publishers is often unattainable with their current data sets. Vanity metrics may be readily available but are less flexibility. For instance vanity metrics may not be able to tell you about unique users. Aggregated pageview counts are generally not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface. Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.
There is no denying that the next best thing for content analytics is data science. Yet for many content publishers this is often unattainable with their current data sets. Vanity metrics may be readily available but have less flexibility, especially when unique users and pageviews are not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface. Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.
The most important role of the leading data scientist and analytics person is to lay a solid foundation for the rest of the team.
There is a high demand for enterprising data scientist and data professionals to pave the way for data in all types of organizations. Businesses have the FOMO or fear-of-missing-out and the frenzy for talent is exacerbated by the increasing number of data sources, third party tools, and success stories. As businesses orient to data driven cultures, professionals have opportunities to reframe their experience, skills, and abilities to meet those needs.