Predicting LGBT Rights

Introduction

One day on twitter I saw a tweet that linked to the Movement Advancement Project which provided a tally for how supportive each state was of LGBT rights. I haven’t taken a long time to really dig into the methodology of their index construction, but it seemed like a good jumping off point for doing some data anlaysis.

So, here’s what I wanted to do.

Work on my scraping skills, especially from websites.
Understand what’s the best way to combine different datasets that may be slightly different
Build a Shiny app to display my results.

I thought that would be a nice way to incrementally increase my ability.

Data Cleaning

Try as I might, I really couldn’t find a good way to scrape the LGBT data from their website. Instead I did it the old fashioned way, just plug and chug into an excel sheet. It didn’t take too long, only about 150 entries. In addition I had some other data that I wanted to throw in there. One was a dataset from the ARDA which provides county level religious demography information. In addition, Richard Fording provides state level political ideology scores on his website.

ideo <- read.csv("D:/state_ideo.csv", stringsAsFactors = FALSE)
lgbt <- read.csv("D:/lgbt/lgbt.csv", stringsAsFactors = FALSE)
census <- read.dta("D:/relcensus.dta", convert.factors = FALSE)

Because the religious census data is at the county level, I need to aggregate to the state level instead.

evan <- aggregate(census$evanrate, list(census$stname), mean, na.rm = TRUE)

So I want to merge that with my LGBT data. Merge means that I have to create a “key” variable so that R can match the proper rows. Unfortunately that will take a little work. One dataset has the full state name, the other only has the abbreviations. Luckily I found a function online that makes the necessary change.

stateFromLower <-function(x) {
  #read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
  st.codes<-data.frame(
    state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
                      "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
                      "MI", "MN", "MO", "MS",  "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
                      "NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
                      "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
    full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
                     "connecticut","district of columbia","delaware","florida","georgia",
                     "hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
                     "louisiana","massachusetts","maryland","maine","michigan","minnesota",
                     "missouri","mississippi","montana","north carolina","north dakota",
                     "nebraska","new hampshire","new jersey","new mexico","nevada",
                     "new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
                     "rhode island","south carolina","south dakota","tennessee","texas",
                     "utah","virginia","vermont","washington","wisconsin",
                     "west virginia","wyoming"))
  )
  #create an nx1 data.frame of state codes from source column
  st.x<-data.frame(state=x)
  #match source codes with codes from 'st.codes' local variable and use to return the full state name
  refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
  #return the full state names in the same order in which they appeared in the original source
  return(refac.x)
  
}

Then I need to convert the other dataset to lower case letters and make the merge.

lgbt$statename<-stateFromLower(lgbt$state)
evan$statename <- tolower(evan$Group.1)
df <- merge(lgbt, evan, by=c("statename"))

Then I realized something. This data isn’t out of order. It’s just the states listed in alphabetical order. Cbind is much simpler because it doesn’t take all the conversion functions. However, the problem is that the some datasets also have Washington, D.C. and/or Puerto Rico. I need to make sure that each is just fifty states.

df <- df[-c(9),]
mainline <- aggregate(census$mprtrate, list(census$stname), mean, na.rm = TRUE)
mainline <- mainline[-c(9),]
mainline$mainline <- mainline$x
mainline <- select(mainline, mainline)
df <- cbind(df, mainline)

What I did then is remove D.C. from both the main dataset (df) and the dataset that contains mainline protestants and then just did cbind. That’s much easier. Then I’m going to do the same with ideology.

ideo <- subset(ideo, year==2014)
ideo$ideology <- ideo$inst6014_nom
ideo <- select(ideo, ideology)
df <- cbind(df, ideo)
df <- select(df, state, orientation, identity, overall, totalpop, popper, x, mainline, ideology)
df$evanrate <- df$x
df$x <- NULL
head(df)

##   state orientation identity overall totalpop popper mainline  ideology
## 1    AL        2.50    -2.50    0.00   103533    2.8 68.46564 19.058900
## 2    AK        4.25     0.75    5.00    18505    3.4 76.97126 35.443470
## 3    AZ        5.00    -1.75    3.25   192362    3.9 19.38156  3.017494
## 4    AR        4.00    -1.50    2.50    78339    3.5 59.32496 48.899120
## 5    CA       16.00    14.25   30.25  1152048    4.0 23.81355 87.428050
## 6    CO       13.50     9.25   22.75   126599    3.2 63.32743 72.440340
##    evanrate
## 1 428.40486
## 2 106.61226
## 3  87.28141
## 4 406.57587
## 5  88.83111
## 6 138.46477

So, that’s a good start. Now, I want to add some election results. I found the 2012 presidential election results from the FEC website. Problem is, that data is a mess.

elect <- read.csv("D:/lgbt/elec12.csv", stringsAsFactors = FALSE)
head(elect)

##    X     obama    romney  others      total
## 1 AL   795,696 1,255,925  22,717  2,074,338
## 2 AK   122,640   164,676  13,179    300,495
## 3 AZ 1,025,232 1,233,654  40,368  2,299,254
## 4 AR   394,409   647,744  27,315  1,069,468
## 5 CA 7,854,285 4,839,958 344,304 13,038,547
## 6 CO 1,323,102 1,185,243  61,177  2,569,522

There are commas in there and it’s not numeric. I gotta strip all that out and convert it to numeric before I can cbind it to my main dataframe.

elect$obama <- gsub(',', '', elect$obama)
elect$romney <- gsub(',', '', elect$romney)
elect$others <- gsub(',', '', elect$others)
elect$total <- gsub(',', '', elect$total)
elect$obama <- as.numeric(elect$obama)
elect$romney <- as.numeric(elect$romney)
elect$others <- as.numeric(elect$others)
elect$total <- as.numeric(elect$total)
elect$obama_share <- elect$obama/elect$total
elect$romney_share <- elect$romney/elect$total
elect <- elect[-c(9),]
elect <- select(elect, obama_share, romney_share)
df <- cbind(df, elect)

One last thing. I wanted to grab some state demographic information. I searched for a while before I realized that the best data just comes from Wikipedia tables. So I had to learn some webscraping.

html = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_African-American_population")
aa = html_table(html_nodes(html, "table")[[3]])
aa$percent_AA <- aa$`% African-American`
aa$percent_AA <- as.numeric(sub("%", "", aa$percent_AA))
aa <- select(aa, percent_AA)
df <- cbind(df, aa)

html = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_Hispanic_and_Latino_population")
hisp = html_table(html_nodes(html, "table")[[4]])
hisp$percent_hisp<-hisp$`2012`
hisp <- hisp[-c(1),]
hisp <- hisp[-c(9),]
hisp <- hisp[-c(51),]
hisp <- select(hisp, percent_hisp)
hisp$percent_hisp <- as.numeric(sub("%", "", hisp$percent_hisp))
df <- cbind(df, hisp)
head(df)

##   state orientation identity overall totalpop popper mainline  ideology
## 1    AL        2.50    -2.50    0.00   103533    2.8 68.46564 19.058900
## 2    AK        4.25     0.75    5.00    18505    3.4 76.97126 35.443470
## 3    AZ        5.00    -1.75    3.25   192362    3.9 19.38156  3.017494
## 4    AR        4.00    -1.50    2.50    78339    3.5 59.32496 48.899120
## 5    CA       16.00    14.25   30.25  1152048    4.0 23.81355 87.428050
## 6    CO       13.50     9.25   22.75   126599    3.2 63.32743 72.440340
##    evanrate obama_share romney_share percent_AA percent_hisp
## 1 428.40486   0.3835903    0.6054582      37.30          4.1
## 2 106.61226   0.4081266    0.5480158      32.40          6.1
## 3  87.28141   0.4458977    0.5365453      31.40         30.2
## 4 406.57587   0.3687899    0.6056694      30.10          6.8
## 5  88.83111   0.6023896    0.3712038      28.48         38.2
## 6 138.46477   0.5149215    0.4612698      26.38         21.0

To be honest I just had to play around with the html_table command and change the number until it actually grabbed the correct table. Luckily there weren’t too many tables and it only took a few minutes. I cleaned the data and then did cbind it to get it in the correct format.

So, now I we have our data in a manageable format. Let’s do some mapping.

df$region<-stateFromLower(df$state)
df$value <- df$overall
choro = StateChoropleth$new(df)
choro$title = "Support for LGBT Rights"
choro$set_num_colors(1)
choro$ggplot_polygon = geom_polygon(aes(fill = value), color = NA)
choro$ggplot_scale = scale_fill_gradientn(name = "Overall Tally", colours = viridis(32))
choro$render()

center

The south is not the best place to live if you are LGBT. The Pacific coast states are the most supportive, followed by New England. There is a surprising amount of support in the upper Midwest including Illinois, Wisconsin, Iowa, and Minnesota.

Let’s do a little regressing to look for relationships.

df$region <- NULL
df$value <- NULL
reg1 <- lm(overall ~ totalpop + popper + ideology + evanrate + mainline + ideology + obama_share + romney_share + percent_AA + percent_hisp, data =df)
summary(reg1)

## 
## Call:
## lm(formula = overall ~ totalpop + popper + ideology + evanrate + 
##     mainline + ideology + obama_share + romney_share + percent_AA + 
##     percent_hisp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8023 -2.4612  0.2235  2.4692  6.7007 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.377e+01  9.576e+01   0.144   0.8864    
## totalpop      1.370e-06  4.082e-06   0.336   0.7389    
## popper        1.204e+00  1.212e+00   0.994   0.3263    
## ideology      1.569e-01  3.119e-02   5.031 1.07e-05 ***
## evanrate     -1.636e-02  6.836e-03  -2.393   0.0215 *  
## mainline      6.294e-04  1.020e-02   0.062   0.9511    
## obama_share  -3.001e+00  9.479e+01  -0.032   0.9749    
## romney_share -2.349e+01  9.946e+01  -0.236   0.8145    
## percent_AA   -5.460e-02  6.923e-02  -0.789   0.4350    
## percent_hisp  2.126e-01  8.548e-02   2.487   0.0172 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.414 on 40 degrees of freedom
## Multiple R-squared:  0.832,	Adjusted R-squared:  0.7943 
## F-statistic: 22.02 on 9 and 40 DF,  p-value: 7.719e-13

Just two things predict higher LGBT support. State ideology, which makes sense. The other is percent hispanic. That may not be because of hispanics specifically but because hispanics tend to migrate to states that are already liberal such as California. One negative predictor is the number of evangelicals.

I wanted to make this into an interactive Shiny app. I had to make some conversions in my data and build a ui.R file, along with a server.R file. Those can be found in my Github account.

The Shiny app can be accessed at https://ryanburge.shinyapps.io/Predicting_LGBT_Rights/. It includes the ability to generate a regression line as well a correlation between the two values chosen. Possible X variables include the overall tally, the sexual orientation tally, and the gender identity tally. The Y axis contains all the variables contained in the regression that I just displayed.

Social Science + Data Science

Code Snippets

Predicting LGBT Rights

Introduction

Data Cleaning