While baby name articles are mandatory reading for soon to be parents, the U.S. Social Security’s (SSA’s) Baby Names data set should be a required for budding data scientists. The data set can be sliced and diced in many different ways, including language and time based methods, and answer creative questions. Other examples of frequency data set are everywhere. For example, website analytics tools counts the number of visits by unique users, retail point of sale systems count products sold by color, banker track the number of loans defaulted in a month, and marketers survey customer satisfaction. This article is a feature engineering tutorial on frequency data set. Here, the goal of feature engineering is used to distill characteristics of each name and capture relationships between name trends into a matrix of values. Continue reading “Optimize Data Science Models with Feature Engineering: Cluster Analysis, Metrics Development, and PCA with Baby Names Data”
Data science is able to detect anomalies, such as fraud and security anomalies. Can it also be used to find sneaky political practices? Quantifying and featuring-izing the “messy” world of politics can elucidate order and truth. Politics is one of the most important yet uninteresting place to the public, who have everything to gain and lose from every day decisions. Note: This post was written before Trump was elected to office…
We all should be sitting at the edge of our seats in the next couple of months. Change is inevitable but the change agent may be questionable. To get psyched for this last stretch before the elections, I apply natural language processing (NLP) on this week’s first presidential debates with a focus on polarity in sentiment. Visualizations in this post include interactive candidate polarity graph and word clouds. Continue reading “Clinton v. Trump: Candidate Sentiment from the 1st Presidential Debate”
There is no denying that data science helps with online content but for many and most content publishers is often unattainable with their current data sets. Vanity metrics may be readily available but are less flexibility. For instance vanity metrics may not be able to tell you about unique users. Aggregated pageview counts are generally not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface. Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.
There is no denying that the next best thing for content analytics is data science. Yet for many content publishers this is often unattainable with their current data sets. Vanity metrics may be readily available but have less flexibility, especially when unique users and pageviews are not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface. Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.
It’s not news that there has been a nation wide hike in crimes across the United States, including Los Angeles. NPR episode on LAPD. Inequality and crimes against fellow humans are disheartening and could often seem impossible to resolve. Open data provides one resource for viewing impossible problems and collaborating on solutions. In this analysis, LA city domestic violence counts are viewed in time series and compared by areas in the city.
This analysis is a reaction to Kansas City Star article, “Asian-Americans narrow wealth gap, new studies show,” which oversimplifies income and race trends. It aggregates “Asian-Americans” into a group and tells the story of averages. This is not uncommon in major coverage of demographics and Asian Americans. In demonstrating issues with disaggregation, data from U.S. Census dataset from UCI Machine Learning Library, here and here, are compared with the findings from a St. Louis (STL) Federal Reserve paper on The Demographics of Wealth. Demographic data aggregation tells the wrong story of income and race in the United States. There are cases where metrics should be aggregated but in those cases the advantages must be laid out.
Money can’t buy love, but it improves your bargaining position – Christopher Marlowe
The quote from Elizabethan tragedian playwright Christopher Marlowe was probably commenting on the state of love affairs in the 1500s. Money is a consistent alluding factor in politics and will forever complicate accountability. Money, power and love (aka the popularity contest portion) push contenders to the top. Without the right balance of resources and public perceived integrity, no one is winning any races and if you don’t win races, then there are “friends” no longer. Ethics guidelines and reporting for elections and public officials attempt to light up the “influence exchange” (analogy to stock exchange).
As Open Data continues to flow from governmental offices to raise the promise for transparency and engagement, the more the public has to roll up its sleeves to review and evaluate the information. Here, all directions to accountability is paved with good intentions and a significant number of wo(man) hours.What? Do you think insights will be handed to you.
This post adds to the evolution the City of Los Angeles’ Open Data sets and breaths human readability and actionability into the information. My first post on Los Angeles’ Influence Exchange explored the contribution of clients to lobbying firms by industry, where I manually created a mapping from client to industry, read more at The City of Los Angeles Influence Exchange (12/2014). It was not especially surprising to discover in the 12/2014 post that Real Estate clients conducted the most influencing activities with local government leaders. AND that is only on paper. In land use and transportation policy context, this makes sense since the land use (such as zoning, variances, and permits) decision-making is concentrated in local governments.
In this post, I added the locations or project geocodes of where influencing is occurring in the City of Los Angeles. The “CEC City Projects Agencies Lobbied by Registered” provides a well-populated “Location” field, with the local street address of the project or area paid by the client to influence. The Location field was used to pull the latitude and longitude from the Google Geocoding API. Then, I leveraged the folium package (python and leaflet) to map the projects by year in the City of Los Angeles. Below is an interactive map of projects influenced in 2014, labels of the points are project names or when project name was blank, the field was populated by “client last name.” Coming soon, 2013 influenced project data points and link to code.
Notice that project “location” default to Los Angeles City Hall in downtown Los Angeles (zoom into downtown Los Angeles to see for yourself ) and upon zooming out, there is one project location in Florida. It is not uncommon for outside individuals, companies, and organizations to spend money on city lobbying for a specific cause.
From the 2014 projects map, does there appear to be an imbalance of projects by geography? What is happening in your neighborhood? Location is a very important factor in influence since the Real Estate industry is pouring significantly more money into interactions with public leaders. With just a little more dedicated digging, Open Data and visualization can be converted to a cause and statistic/visual for advocacy and help the public and decision-makers pinpoint outliers and patterns more quickly.
Here are next steps for this data analysis, both note to self or to other Open Data wranglers:
- Describe projects by department (this field in the dataset is extra messy, since all relevant city departments lobbied are separated by |), such as count the times each city department is lobbied and then compare counts to department budgets and decisions
- Show the projects being lobbied changing over the year (LA Open Data only provides 2 years of data)
- Map projects by category and influence contribution (this breakdown can lead to interesting metrics, such as proportions such as influence by square mile, entitlements/permits by influence exchanged
Additional considerations for the LA: A Well Run City datasets as it relates to readability and accessibility for the public:
- Create common fields between related data sets so that the public may merge information within the Open Data portal
- Either clean or release a statement about each dataset with data cleansing suggestions, this is the first step anyone needs to when evaluating and reviewing any and all datasets
- Add “cleaned” dimensions and connections in the data that have value for the public
Read the first post in the LA Influence Exchange Series, here.
*Social media* has a many layers and, in the business and data senses, it is growing up nicely. Social sharing platforms provides developer access to their data, such as membership interactions and status updates, which can come as emotional outpourings, diatribes, celebrations, and affirmations. Armed with time (most difficult thing to come by on this list), stack overflow, reference materials, and an open source coding tool — anyone can quickly #oneup your *social media listening* skills. Not a bad skill to flaunt around, since positions managing and creating content on social media are increasing and relevant in every sector and job function. Now, adding the third word – listening – gives social media scouring, participating, and downloading another lift in professionalism.