While baby name articles are mandatory reading for soon to be parents, the U.S. Social Security’s (SSA’s) Baby Names data set should be a required for budding data scientists. The data set can be sliced and diced in many different ways, including language and time based methods, and answer creative questions. Other examples of frequency data set are everywhere. For example, website analytics tools counts the number of visits by unique users, retail point of sale systems count products sold by color, banker track the number of loans defaulted in a month, and marketers survey customer satisfaction. This article is a feature engineering tutorial on frequency data set. Here, the goal of feature engineering is used to distill characteristics of each name and capture relationships between name trends into a matrix of values. Continue reading “Optimize Data Science Models with Feature Engineering: Cluster Analysis, Metrics Development, and PCA with Baby Names Data”
Money can’t buy love, but it improves your bargaining position – Christopher Marlowe
The quote from Elizabethan tragedian playwright Christopher Marlowe was probably commenting on the state of love affairs in the 1500s. Money is a consistent alluding factor in politics and will forever complicate accountability. Money, power and love (aka the popularity contest portion) push contenders to the top. Without the right balance of resources and public perceived integrity, no one is winning any races and if you don’t win races, then there are “friends” no longer. Ethics guidelines and reporting for elections and public officials attempt to light up the “influence exchange” (analogy to stock exchange).
As Open Data continues to flow from governmental offices to raise the promise for transparency and engagement, the more the public has to roll up its sleeves to review and evaluate the information. Here, all directions to accountability is paved with good intentions and a significant number of wo(man) hours.What? Do you think insights will be handed to you.
This post adds to the evolution the City of Los Angeles’ Open Data sets and breaths human readability and actionability into the information. My first post on Los Angeles’ Influence Exchange explored the contribution of clients to lobbying firms by industry, where I manually created a mapping from client to industry, read more at The City of Los Angeles Influence Exchange (12/2014). It was not especially surprising to discover in the 12/2014 post that Real Estate clients conducted the most influencing activities with local government leaders. AND that is only on paper. In land use and transportation policy context, this makes sense since the land use (such as zoning, variances, and permits) decision-making is concentrated in local governments.
In this post, I added the locations or project geocodes of where influencing is occurring in the City of Los Angeles. The “CEC City Projects Agencies Lobbied by Registered” provides a well-populated “Location” field, with the local street address of the project or area paid by the client to influence. The Location field was used to pull the latitude and longitude from the Google Geocoding API. Then, I leveraged the folium package (python and leaflet) to map the projects by year in the City of Los Angeles. Below is an interactive map of projects influenced in 2014, labels of the points are project names or when project name was blank, the field was populated by “client last name.” Coming soon, 2013 influenced project data points and link to code.
Notice that project “location” default to Los Angeles City Hall in downtown Los Angeles (zoom into downtown Los Angeles to see for yourself ) and upon zooming out, there is one project location in Florida. It is not uncommon for outside individuals, companies, and organizations to spend money on city lobbying for a specific cause.
From the 2014 projects map, does there appear to be an imbalance of projects by geography? What is happening in your neighborhood? Location is a very important factor in influence since the Real Estate industry is pouring significantly more money into interactions with public leaders. With just a little more dedicated digging, Open Data and visualization can be converted to a cause and statistic/visual for advocacy and help the public and decision-makers pinpoint outliers and patterns more quickly.
Here are next steps for this data analysis, both note to self or to other Open Data wranglers:
- Describe projects by department (this field in the dataset is extra messy, since all relevant city departments lobbied are separated by |), such as count the times each city department is lobbied and then compare counts to department budgets and decisions
- Show the projects being lobbied changing over the year (LA Open Data only provides 2 years of data)
- Map projects by category and influence contribution (this breakdown can lead to interesting metrics, such as proportions such as influence by square mile, entitlements/permits by influence exchanged
Additional considerations for the LA: A Well Run City datasets as it relates to readability and accessibility for the public:
- Create common fields between related data sets so that the public may merge information within the Open Data portal
- Either clean or release a statement about each dataset with data cleansing suggestions, this is the first step anyone needs to when evaluating and reviewing any and all datasets
- Add “cleaned” dimensions and connections in the data that have value for the public
Read the first post in the LA Influence Exchange Series, here.
Rideshare services are changing in 2015. The high demand service has also been high drama for the public, users, and the public sector. It doesn’t help that rideshare industry leaders are overly careless when sharing their views about women and journalism, and overly silent on liability in tragic accidents with drivers tied to their services. The public sector started to respond to issues with the rideshare services, which are not covered under the current taxi regulations. There is almost full certainty that rideshare will be pulled under more regulations in 2015 but how much is the question. As the services currently exist there are pros and cons to both, no debate here.