CLICK HERE TO DOWNLOAD THE STUDENT RMD FILE

Our objective today is to review basic R syntax to help ensure that you are comfortable manipulating data and working with loops and functions. We are not working on any real political science data, we are not working with any fancy statistical algorithms – we are just trying to get our minds into a headspace where we can understand code and think like programmers.

We are going to cover three main technical topics today: 1. Subsetting variables in R 2. For loops 3. Functions

The first thing we should all focus on is remembering to be curious. Programming is about repeatedly trying things, letting things break, looking up help, and learning. It is not about memorization. Don’t be afraid of errors. Getting good at programming is mostly a matter of learning to see a big, complex problem, intuitively break it down into a bunch of small steps, and then implement them one by one.

Let’s begin by loading a dataset that actually comes with R:

# The dataset library contains many interesting datasets.
library(datasets) 

# Take the 1977 states dataset, load it into a data frame.
df = as.data.frame(state.x77) 

We can take a quick peak at the top few rows of the data frame:

head(df)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766

We can probably guess that our data has 50 rows. A row is an “observation”, and in this case an observation is a state. We can also see that the data has eight columns. We’re often interested in taking a “slice” of this data; either certain rows, certain columns, or both. There are many ways to take slices in R.

Taking Slices

First, we know that can use the $ operator on a data frame to take out a single column. When we use the $ operator, we convert from a 50x8 matrix to a vector of length 50. We can think of a vector as a one dimensional list of numbers or text, left to right.

df$Murder
##  [1] 15.1 11.3  7.8 10.1 10.3  6.8  3.1  6.2 10.7 13.9  6.2  5.3 10.3  7.1
## [15]  2.3  4.5 10.6 13.2  2.7  8.5  3.3 11.1  2.3 12.5  9.3  5.0  2.9 11.5
## [29]  3.3  5.2  9.7 10.9 11.1  1.4  7.4  6.4  4.2  6.1  2.4 11.6  1.7 11.0
## [43] 12.2  4.5  5.5  9.5  4.3  6.7  3.0  6.9

Converting it to a vector will also remove all the “row names”. Although the states are still in order, we have no easy way of checking which murder rate is connected to each state once we convert the data to a vector.

We also have the [] operator. The square brackets let you take a slice of any type of data. The number of ways you can take a slice using the square brackets depend on your type of data. A vector has ONE dimension, so you can take a slice along that dimension. A data frame has TWO dimensions, so you can take slices by row, or by columns.

Let’s look at a simple one-dimensional example:

df$Murder[2]
## [1] 11.3

This will take the second item out of our vector. The number associated with the order of the items is called an “index”, and you can always take slices by index. You can also provide more than one index:

df$Murder[1:10]
##  [1] 15.1 11.3  7.8 10.1 10.3  6.8  3.1  6.2 10.7 13.9

Here, I asked for the first 10 indices. What R is doing internally is converting the code “1:10” to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. What looked like a range actually gets expanded to a vector of indices in R. We can also provide our own vector of indices:

df$Murder[c(1, 3, 5, 7, 9)]
## [1] 15.1  7.8 10.3  3.1 10.7

Let’s read our code left to right:

  1. Take the data frame df
  2. Using the $ operator, select the column Murder and turn it into a vector
  3. Using the [] operation, get the first, third, fifth, seventh, and ninth values from the data.

An animation might make this a little clearer:

plot_vector_subset(df$Murder[c(1, 3, 5, 7, 9)], animate=1)

Besides indices, R also allows us to provide a vector of TRUE and FALSE values: TRUE meaning that we want to include that index, and FALSE meaning that we don’t. This is what is going on behind the scenes when you feed R an expression inside a subset operation.

Let’s look at an example where we want to get the murder rate, but only for states with more than 5 million people:

# Population is in thousands, so 5,000,000 people = 5,000 on this variable
more_than_five_million <- df$Population > 5000

more_than_five_million
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [12] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [34] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

What happened here? R took the vector df$Population, and for each value, checked if it was above 5,000 (5,000,000 people). If it was, it feeds back the value TRUE. If not, FALSE. So we have a series of 50 TRUE/FALSE values in the order of the states in the original data frame.

Now, let’s take all the murder rates, and then show me just the ones where the matching state has a population above 5 million:

df$Murder[more_than_five_million]
##  [1] 10.3 10.7 10.3  7.1  3.3 11.1  5.2 10.9 11.1  7.4  6.1 12.2

And animated…

plot_vector_subset(df$Murder[more_than_five_million], animate=1)

So far, we’ve taken a single variable out of the data frame, and then subsetted to just a few values from the vector. But often, we want to keep working directly in the data frame, taking whole rows at a time.

Let’s try slicing a different way:

df[c(20:25), ]
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Maryland 4122 5299 0.9 70.22 8.5 52.3 101 9891
Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826
Michigan 9111 4751 0.9 70.63 11.1 52.8 125 56817
Minnesota 3921 4675 0.6 72.96 2.3 57.6 160 79289
Mississippi 2341 3098 2.4 68.09 12.5 41.0 50 47296
Missouri 4767 4254 0.8 70.69 9.3 48.8 108 68995

What does the , we see mean? Remember that we are working with a data frame, which is two dimensions. The first dimension is rows. Everything before the comma is a filter for the rows. The second dimension is columns. Everything after the comma is a filter for the columns. If you don’t provide a filter, you ask for everything along that dimension.

Let’s read the code left to right again:

  1. Take the data frame df.
  2. Using the [] operator, subset to the rows with indices 20-25
  3. Don’t do any filters on column – return the full rows.

Writing the appropriate subsets can take some getting used to. I find that the best way to get used to programming is to think about what you are asking the computer to do, and how you might do it by hand. Thinking like a programmer just means thinking like a human: how would a human do this?

df[df$Population > 5000, c(1:3)]

As a human, the way you might process this would be to look at each state one by one. For each state, we ask if the population is above 5,000,000. If it isn’t, we might use a marker to cross off the state. Then, when we’re done all the states, we might use a marker to cross off all the columns we don’t care about anymore. What’s left is exactly the subset we wanted.

Let’s see how the computer subsets:

plot_subset(df[df$Population > 10000, c(1:3)], animate=1)