ggplot2 worksheet

library(ggplot2)
table <- read.csv("D:/table.csv", stringsAsFactors = FALSE)

All I did there was load in ggplot2 and then load in my data into a dataset called “table”

There’s a link to the data here

head(table)
##   Code                    Agency     Appropriated     Expended.YTD
## 1  676    UNIVERSITY OF ILLINOIS $350,599,000.00  $278,180,288.41 
## 2  664    SOUTHERN IL UNIVERSITY $106,156,000.00  $106,123,726.80 
## 3  644    NORTHERN IL UNIVERSITY  $48,293,000.00   $48,187,561.01 
## 4  636 ILLINOIS STATE UNIVERSITY  $38,291,000.00   $38,232,065.81 
## 5  628     WESTERN IL UNIVERSITY  $31,389,000.00   $31,388,984.95 
## 6  612     EASTERN IL UNIVERSITY  $26,222,000.00   $13,625,751.68 
##               Jul             Aug              Sep    Oct    Nov    Dec
## 1 $52,107,615.06     $138,264.19  $225,934,409.16  $0.00  $0.00  $0.00 
## 2 $12,865,996.99  $93,063,842.33      $193,887.48  $0.00  $0.00  $0.00 
## 3 $48,158,000.00      $11,948.93       $17,612.08  $0.00  $0.00  $0.00 
## 4 $38,215,378.90       $5,448.37       $11,238.54  $0.00  $0.00  $0.00 
## 5 $14,994,121.25  $15,477,361.79      $917,501.91  $0.00  $0.00  $0.00 
## 6  $6,125,882.69   $3,007,007.49    $4,492,861.50  $0.00  $0.00  $0.00 
##      Jan    Feb    Mar    Apr    May    Jun      Unexpended
## 1 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  $72,418,711.59 
## 2 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00      $32,273.20 
## 3 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     $105,438.99 
## 4 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00      $58,934.19 
## 5 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00          $15.05 
## 6 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  $12,596,248.32

Now, we’ve got a problem. Both the appropriated column and the unexpended column have quotation marks around them. That’s because R doesn’t seem them as numbers, it sees them as a character string. A big reason why is because there is a dollar sign and commmas in each of those entries. I need to take those out.

table <- read.csv("D:/table.csv", stringsAsFactors = FALSE)
table$Appropriated <- gsub('\\$','', table$Appropriated)
table$Appropriated <- gsub(',','', table$Appropriated)
table$Unexpended <- gsub('\\$','', table$Unexpended)
table$Unexpended <- gsub(',','', table$Unexpended)
head(table)
##   Code                    Agency  Appropriated     Expended.YTD
## 1  676    UNIVERSITY OF ILLINOIS 350599000.00  $278,180,288.41 
## 2  664    SOUTHERN IL UNIVERSITY 106156000.00  $106,123,726.80 
## 3  644    NORTHERN IL UNIVERSITY  48293000.00   $48,187,561.01 
## 4  636 ILLINOIS STATE UNIVERSITY  38291000.00   $38,232,065.81 
## 5  628     WESTERN IL UNIVERSITY  31389000.00   $31,388,984.95 
## 6  612     EASTERN IL UNIVERSITY  26222000.00   $13,625,751.68 
##               Jul             Aug              Sep    Oct    Nov    Dec
## 1 $52,107,615.06     $138,264.19  $225,934,409.16  $0.00  $0.00  $0.00 
## 2 $12,865,996.99  $93,063,842.33      $193,887.48  $0.00  $0.00  $0.00 
## 3 $48,158,000.00      $11,948.93       $17,612.08  $0.00  $0.00  $0.00 
## 4 $38,215,378.90       $5,448.37       $11,238.54  $0.00  $0.00  $0.00 
## 5 $14,994,121.25  $15,477,361.79      $917,501.91  $0.00  $0.00  $0.00 
## 6  $6,125,882.69   $3,007,007.49    $4,492,861.50  $0.00  $0.00  $0.00 
##      Jan    Feb    Mar    Apr    May    Jun   Unexpended
## 1 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  72418711.59 
## 2 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     32273.20 
## 3 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00    105438.99 
## 4 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     58934.19 
## 5 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00        15.05 
## 6 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  12596248.32

Now they are cleaned up, I can convert them to numbers

table$Unexpended <- as.numeric(table$Unexpended)
table$Appropriated <- as.numeric(table$Appropriated)

I want to know what percentage of appropriated funds have not been expended by the state. I will create a new variable called “new” which will be unexpended funds divided by appropriated funds to give me a percentage. I also need to divide it by a thousand to get me to the proper percentage as well as subtract it from 1.

table$new <- table$Unexpended/table$Appropriated
table$new/1000
##  [1] 2.065571e-04 3.040167e-07 2.183318e-06 1.539113e-06 4.794673e-10
##  [6] 4.803695e-04 1.586633e-04 3.647201e-06 8.475272e-04 7.645645e-06
## [11] 1.588562e-12 4.773820e-06 8.987647e-04 0.000000e+00 8.099158e-04
## [16] 8.396239e-04
table$new <- round(table$new, digits =2)
table$new <- 1 - table$new
table$new
##  [1] 0.79 1.00 1.00 1.00 1.00 0.52 0.84 1.00 0.15 0.99 1.00 1.00 0.10 1.00
## [15] 0.19 0.16

Well, look at that. I had to add a rounding command to trim the digits up a little bit.

Let’s take a crack at plotting.

ggplot(table, aes(x= Agency, y= new)) + geom_bar(stat="identity")

center

Here’s what that command did. From left to right.

“table” is the name of my dataset.

aes(x= Agency, y= new) tells ggplot to make my x axis the “Agency” variable and my y axis to be the “new” variable we created above.

And then, geom_bar(stat=”identity”) tells ggplot to do a bar chart and the “stat=identity” part tells R to plot the values exactly as they exist in the dataset, otherwise it would give you a count, which is not helpful.

However, I want the chart to be easy to read. One way to do that is to make the x axis sorted in a way that makes sense. Here’s how to do that.

ggplot(table, aes(x=reorder(Agency, -new), y = new)) + geom_bar(stat="identity")

center

The “(x=reorder(Agency, -new)” is essentially this: my X axis is going to be the names of the agencies but I don’t want ggplot to just plot them in the same order as they exist in the dataset, I want it to plot them in order of highest percentage to least percentage, remember that’s my “new” variable that I created above. I put the negative sign to do that in descending order. If you left that off it would plot them order from least to most, not most to least.

That’s way easier to read.

So, we’re in the ballpark now.

First thing is to make the names of the agency to be readable.

ggplot(table, aes(x=reorder(Agency, -new), y = new)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90))

center

All I did was add this bit: + theme(axis.text.x = element_text(angle = 90)). That rotates the variable labels to 90 degrees. Now they are readable.

Now, I’ve got another problem that I want to fix. I’ve got a couple columns in there that aren’t state universities. I want to dump them. Let’s take a look at what rows they are.

table$Agency
##  [1] "UNIVERSITY OF ILLINOIS"         "SOUTHERN IL UNIVERSITY"        
##  [3] "NORTHERN IL UNIVERSITY"         "ILLINOIS STATE UNIVERSITY"     
##  [5] "WESTERN IL UNIVERSITY"          "EASTERN IL UNIVERSITY"         
##  [7] "IL COMMUNITY COLLEGE BOARD"     "NORTHEASTERN IL UNIVERSITY"    
##  [9] "IL MATH AND SCIENCE ACADEMY"    "GOVERNORS STATE UNIVERSITY"    
## [11] "CHICAGO STATE UNIVERSITY"       "IL STUDENT ASSISTANCE COMM"    
## [13] "BOARD OF HIGHER EDUCATION"      "STATUTORY TRANSFERS"           
## [15] "TEACHERS' RETIREMENT SYSTEM"    "STATE UNIV CIVIL SERV MERIT BD"

You can count the rows pretty easily here. I want to dump rows 7, 9, 12, 13, 14, 15, 16.

table <- table[-c(7, 9, 12, 13, 14, 15, 16), ]
table$Agency
## [1] "UNIVERSITY OF ILLINOIS"     "SOUTHERN IL UNIVERSITY"    
## [3] "NORTHERN IL UNIVERSITY"     "ILLINOIS STATE UNIVERSITY" 
## [5] "WESTERN IL UNIVERSITY"      "EASTERN IL UNIVERSITY"     
## [7] "NORTHEASTERN IL UNIVERSITY" "GOVERNORS STATE UNIVERSITY"
## [9] "CHICAGO STATE UNIVERSITY"

Don’t forget the negative sign. That deletes those rows.

Let’s look at the plot again.

ggplot(table, aes(x=reorder(Agency, -new), y = new)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90))

center

Now, we are pretty darn close. I really want to use a pop of color here to make EIU stand out. So, here’s how I do that. If I look at the table$Agency row I see that EIU is the sixth row. So I want to create a new column where EIU is different somehow. That will allow me to color it differently in the final chart. Here’s how I did that.

table$area.color <- c("one", "one", "one", "one", "one", "two" , "one", "one", "one")
table
##    Code                     Agency Appropriated     Expended.YTD
## 1   676     UNIVERSITY OF ILLINOIS    350599000 $278,180,288.41 
## 2   664     SOUTHERN IL UNIVERSITY    106156000 $106,123,726.80 
## 3   644     NORTHERN IL UNIVERSITY     48293000  $48,187,561.01 
## 4   636  ILLINOIS STATE UNIVERSITY     38291000  $38,232,065.81 
## 5   628      WESTERN IL UNIVERSITY     31389000  $31,388,984.95 
## 6   612      EASTERN IL UNIVERSITY     26222000  $13,625,751.68 
## 8   620 NORTHEASTERN IL UNIVERSITY     19562000  $19,490,653.45 
## 10  616 GOVERNORS STATE UNIVERSITY     12757000  $12,659,464.51 
## 11  608   CHICAGO STATE UNIVERSITY     12590000  $12,589,999.98 
##                Jul             Aug              Sep    Oct    Nov    Dec
## 1  $52,107,615.06     $138,264.19  $225,934,409.16  $0.00  $0.00  $0.00 
## 2  $12,865,996.99  $93,063,842.33      $193,887.48  $0.00  $0.00  $0.00 
## 3  $48,158,000.00      $11,948.93       $17,612.08  $0.00  $0.00  $0.00 
## 4  $38,215,378.90       $5,448.37       $11,238.54  $0.00  $0.00  $0.00 
## 5  $14,994,121.25  $15,477,361.79      $917,501.91  $0.00  $0.00  $0.00 
## 6   $6,125,882.69   $3,007,007.49    $4,492,861.50  $0.00  $0.00  $0.00 
## 8  $19,562,000.00     ($81,475.14)      $10,128.59  $0.00  $0.00  $0.00 
## 10  $3,600,731.90   $9,057,168.37        $1,564.24  $0.00  $0.00  $0.00 
## 11 $12,589,999.98           $0.00            $0.00  $0.00  $0.00  $0.00 
##       Jan    Feb    Mar    Apr    May    Jun  Unexpended  new area.color
## 1  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  72418711.59 0.79        one
## 2  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     32273.20 1.00        one
## 3  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00    105438.99 1.00        one
## 4  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     58934.19 1.00        one
## 5  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00        15.05 1.00        one
## 6  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00  12596248.32 0.52        two
## 8  $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     71346.55 1.00        one
## 10 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00     97535.49 0.99        one
## 11 $0.00  $0.00  $0.00  $0.00  $0.00  $0.00         0.02 1.00        one

So, now we have a new column called area.color that has EIU listed as two and everyone else is listed as one. I can use that to my advantage now.

ggplot(table, aes(x=reorder(Agency, -new), y = new, fill=area.color)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) +scale_fill_manual(values = c("gray", "blue"))

center

I added one new command here: +scale_fill_manual(values = c(“gray”, “blue”)) as well as adding “fill=area.color”

What that does is tell R to add a fill color based on the “area.color” column. The scale_fill_manaul command tells R which colors to pick for that fill command. All “one” will get gray, all “two” will get blue. That’s a pretty advanced task, but it’s easier the second time. Now, just some clean up stuff.

ggplot(table, aes(x=reorder(Agency, -new), y = new, fill=area.color)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) +scale_fill_manual(values = c("gray", "blue")) + xlab("University") + ylab("Percentage of Appropriation Received") + theme(legend.position="none") + theme(text=element_text(size=16, family="Roboto Medium")) + ggtitle("Percentage of Appropriation Received by Each University in Illinois")

center

I added axis labels with the xlab and ylab commands. I added a title with the ggtitle command. I also changed my font with the theme(text=element_text(size=12, family=”Roboto Medium”)) command. I also got rid of the legend, because I don’t really need it and it doesn’t tell me much.