One day on twitter I saw a tweet that linked to the Movement Advancement Project which provided a tally for how supportive each state was of LGBT rights. I haven’t taken a long time to really dig into the methodology of their index construction, but it seemed like a good jumping off point for doing some data anlaysis.
So, here’s what I wanted to do.
Work on my scraping skills, especially from websites.
Understand what’s the best way to combine different datasets that may be slightly different
Build a Shiny app to display my results.
I thought that would be a nice way to incrementally increase my ability.
Data Cleaning
Try as I might, I really couldn’t find a good way to scrape the LGBT data from their website. Instead I did it the old fashioned way, just plug and chug into an excel sheet. It didn’t take too long, only about 150 entries. In addition I had some other data that I wanted to throw in there. One was a dataset from the ARDA which provides county level religious demography information. In addition, Richard Fording provides state level political ideology scores on his website.
Because the religious census data is at the county level, I need to aggregate to the state level instead.
So I want to merge that with my LGBT data. Merge means that I have to create a “key” variable so that R can match the proper rows. Unfortunately that will take a little work. One dataset has the full state name, the other only has the abbreviations. Luckily I found a function online that makes the necessary change.
Then I need to convert the other dataset to lower case letters and make the merge.
Then I realized something. This data isn’t out of order. It’s just the states listed in alphabetical order. Cbind is much simpler because it doesn’t take all the conversion functions. However, the problem is that the some datasets also have Washington, D.C. and/or Puerto Rico. I need to make sure that each is just fifty states.
What I did then is remove D.C. from both the main dataset (df) and the dataset that contains mainline protestants and then just did cbind. That’s much easier. Then I’m going to do the same with ideology.
So, that’s a good start. Now, I want to add some election results. I found the 2012 presidential election results from the FEC website. Problem is, that data is a mess.
There are commas in there and it’s not numeric. I gotta strip all that out and convert it to numeric before I can cbind it to my main dataframe.
One last thing. I wanted to grab some state demographic information. I searched for a while before I realized that the best data just comes from Wikipedia tables. So I had to learn some webscraping.
To be honest I just had to play around with the html_table command and change the number until it actually grabbed the correct table. Luckily there weren’t too many tables and it only took a few minutes. I cleaned the data and then did cbind it to get it in the correct format.
So, now I we have our data in a manageable format. Let’s do some mapping.
The south is not the best place to live if you are LGBT. The Pacific coast states are the most supportive, followed by New England. There is a surprising amount of support in the upper Midwest including Illinois, Wisconsin, Iowa, and Minnesota.
Let’s do a little regressing to look for relationships.
Just two things predict higher LGBT support. State ideology, which makes sense. The other is percent hispanic. That may not be because of hispanics specifically but because hispanics tend to migrate to states that are already liberal such as California. One negative predictor is the number of evangelicals.
I wanted to make this into an interactive Shiny app. I had to make some conversions in my data and build a ui.R file, along with a server.R file. Those can be found in my Github account.
The Shiny app can be accessed at https://ryanburge.shinyapps.io/Predicting_LGBT_Rights/. It includes the ability to generate a regression line as well a correlation between the two values chosen. Possible X variables include the overall tally, the sexual orientation tally, and the gender identity tally. The Y axis contains all the variables contained in the regression that I just displayed.