Data science is able to detect anomalies, such as fraud and security anomalies. Can it also be used to find sneaky political practices? Quantifying and featuring-izing the “messy” world of politics can elucidate order and truth. Politics is one of the most important yet uninteresting place to the public, who have everything to gain and lose from every day decisions. Note: This post was written before Trump was elected to office…
Legislative bills help shape public policy so when and how laws change impact government functions, such as spending, grant making, and programming. Bill success is not the best or only way to keep score, however bills are one of the few tangible and public ways to observe the moving and shaking in politics. How else can the public keep track of elected officials and civil servants? What analytics and metrics do citizens have to evaluate politicians work? This posts explores topics in California legislative bills.
A major obstacle for people to become interested in politics, laws, and bills is the massive effort required to stay current and have a good working knowledge of history. Using models to predict legislative success are attempts to distill complicated information into digestible pieces and see what models say about political behaviors. Here is a complete presentation on the what, when, where and why for this experiment, slide share presentation. Also find links to specific bills by topic from various sessions between 2009 and 2014 in the presentation slide share link.
The steps for creating legislative predictions:
- Collect Data from Sunlight Foundation API and other open data sources
- Clean text from legislative bills via web scraping, including removing html, stop words, target variable (i.e. bill passage)
- Extract features from text in python
- Build topics from text using Latent Dirchlet Allocation (LDA), probabilistic approach
- Implement supervised learning models
- Analyze results
California bills between 2009 and 2014 were uploaded from the Open States Sunlight Foundation API to MongoDB, including details and actions for 13,569 bills. The count of total topics appearing in any legislative document below with top 10 most frequent topics listed to the right. There are topics that stand out, such as topic 48 seems to appear in every bill. Additional break down about CA bills and topics:
- Bills have an average of 6.57 number of topics, ranging from 2 – 16.
- Passage rate by topic ranged from 18% to 36%, averaging 28% for all bills in the database
- Most frequent topics of legislation relate to local government funding/taxes/leadership initiatives, health care, education, budget and taxes, and court system
- Highest and lowest passage rate by topics are displayed for top 10 and bottom 10 in the following tables with outputs of odds ratio from logistic regression model:
Overall outcome of predicting success or failure of California bills based solely on topics does better than guess with an average 63% accuracy. Broken down by success and failures, there is a higher rate of accuracy for predicting failures versus successes. In our approach with LDA topic models and logistic regression, we can guess with 70% accuracy which topics fail and with 47% accuracy which bills will success. Given that passing legislation is difficult this means that any model requires more features to understand this complex process.
Initially, a Gensim topic model is trained from the entire corpus of legislation. Then each bill is allocated probabilities by topic, which are used as weights in a logistic regression model. A list of topics that strengthen and weaken CA bill passage are provided below. LDA topics are combinations words and phrases combinations that represent each numerical topic. It is up to the users of this data to allocate names or types of topics.
This is a fairly decent start for developing a robust model to predict legislation. There are so many more features available to visualize, test, and clean. Next steps for this analysis are:
- Add time context for bills in terms of legislative sessions, chambers, and major political events
- Adding features about the bill, sponsors, districts, political context, duration, committees, public comments
- Include exploratory data analysis from bill and legislator data
- Tune model to apply predictions to current bills