We are going to run an OLS regression to predict support for Governor Bruce Rauner, that is our dependent variable. Then we are going to have a number of independent variables including party identification, gender, union membership, region of the state, age, and income.
Before we get to the regression, we have to do some preliminary steps. One of the most important things that you must do before you run a regression is to clean your data.
I am basically dumping a lot of the “don’t know” responses to zero so that they don’t mess up our analysis. It’s important to be consistent with how you deal with DK responses. Make sure to describe how you do that so that you can remember later on.
So, let’s get to the regression part. We are going to use the lm() function from R to construct a linear model. It is going to create an output called reg1. When you run the first command, notice that nothing will happen. If you want to actually see the output in table format, use the summary() command.
So, what you see is a regression output. Each of indepedent variables are listed in the far left column. Then there is the estimate in the second column. Finally, look at the far right? Those are the statistical significance stars. If a row doesn’t have stars, then that relationship is not statistically significant. For example, our DV is Rauner’s approval rating. Look at the age variable. You see how there are no stars there? That means that there is not a statistical relationship between the two things.
One interpretation thing. I had to reverse code the Rauner variable so that higher values is more approval and lower values are less approval. The codebook has it backwards.
Look at the dotwhisker plot I made. Here’s how you interpret that. If that dot or the horizontal red line’s overall with the vertical dashed line, then those variables are not statistically significant. If the dot is to the right and doesn’t overlap zero then it’s positively related, if it’s to the left and doesn’t overlap zero then it is negative. So, we can see that being a member of a labor union has a strong negative impact on Rauner’s approval rating. In fact, we can say that being a member of a union decreases his approval by about ten percent.
Let’s add a racial component. I will create a dichtomous variable for black and then add it to the regression I just ran.
What changes here? What stays the same? Obviously being black is negatively correlated to supporting Rauner. However, we cannot say whether being in a union is more or less powerful than being black when it comes to supporting the Governor because their confidence intervals overlap each other. Same can be said for partyid and male gender.
We use the lm() command because our variable (Governor Rauner’s support) had five possible values 1,2,3,4,5. However, what happens if your dependent variable is dichotomous and therefore has only 2 values? You would have to run a logit. The DV we are looking at here is support for recreational marijuana. I have recoded it so that 1 is supporting the proposal and every other response is a zero in the dataset. Then I will run a logit regression with the glm() command and the same IVs as the previous analysis.
We can see here that there are only a few statistically significant variables. Age, partyid, income, and male. The partyid variable ranges from zero to 7. That’s eight total values. Each step up that scale (toward Republican ID) makes someone 3.3% less likely to support recreational marijuana. Being male makes you 10% more likely to support the proposal, however.