There is a topic I have been skirting around for some time now and I think it is time that we have to have a rather important conversation. It’s one that is almost never fun but is quite necessary because without it, there may be many painful lessons learned in the future. We’re going to spend this next chapter talking about data structures—but not all of them! We’ll only cover the three most common and, by the end of this, it is my hope you will have a much stronger idea of what you are working with and why it behaves the way it does.
We will cover vectors, data frames rather briefly, and lists. We’ll talk about some of their defining characteristics and how we can interact with them. Often the theory behind these object types are omitted, but I am of the mind that learning this early on will pay off in dividends. Take a deep breath before we dive in and remind yourself that it ain’t nothin’ but a thing.
This section is undoubtedly the most theoretically dense from a software perspective of this entire book. These concepts may be a little bit difficult to grasp at the first go around particularly if you do not have a programming background. But do not be discouraged! This is tough and there is no way to around it, so might as well go through it. If you can grasp this chapter programming in R will become so much easier. You will develop an intuition of why certain things happen to your data and how to interact with other data structures.
I like to think of the atomic vector much like the atom—that is as the building block of any R object. You’ve actually been working with atomic vectors this entire time. But we haven’t been very explicit about this yet. Up until this point we have been working mainly with tibbles. And here is the secret: each column of a tibble is actually an atomic vector.
What makes a vector a atomic is that it can only be a single data type and that they are one-dimensional—opposed to tibbles which are two-dimensional56. You may have noticed that every value of a column is of the same data type. This means that they are rather strict to work with and for good reason. Imagine you wanted to multiple a column by 10, what would happen if a few of the values in the column were actually written out as text? Let’s try exploring this idea.
The most common way to create a vector in R is to use the
c() function. This stands for combine. We can
combine as many elements as we want into a single vector using
c(). Each element of the vector is it’s own argument (separated by a comma).
For example if we wanted to create a vector of Boston’s unemployment rate rate for each month in 2019 that we have data for (until October as of this writing on Dec. 18th, 2019) we could write the below. We will save it in a a vector called
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3) unemp #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3
What is really great about vectors is that we can perform any number of operations on them—i.e. find the sum of all the values, the average, add a value to each element, etc.
If we wanted to find the average unemployment rate for Boston for Jan - Oct. 2019, we can supply the vector to the function
mean(unemp) #>  2.72
However, you may be thinking “there are 12 months in a year not 10 and that should be represented” and if you are, I totally agree with you. Since the data for November and December are missing, we should denote that and update
unemp accordingly. R uses
NA to represent missing data. To represent this we can append two
NAs to the vector we have. There are two ways we can do this. We can either combine
unemp with two
NAs, or rewrite the above vector.
# combining existing with 2 NAs c(unemp, NA, NA) #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
This works, but since we will be saving this to
unemp again it is not best practices to use the variable you are changing in that objects assignment.
# for example unemp <- c(unemp, NA, NA)
The above is rather unclear and might confuse someone that will have to read your code at a later time—that person may even be you. For this reason we will redefine it.
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3, NA, NA) unemp #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
We know that there are 12 elements in this vector, but sometimes it is quite nice to sanity check oneself. We can always find out how long (or how many elements are in) a vector is by supplying the vector to the
# how many observations are in `unemp`? length(unemp) #>  12
There are a total of six types of vectors. Fortunately, only four of these really matter to us. These are
Integers represent whole numbers. To specify an integer we append an
L after the number such as
20L. Doubles are any number that requires any precision aka decimal places. You can specify doubles in a number of formats such as scientific notation. Generally the easiest way to do this, though, is using a decimal. Together integers and doubles are lumped into the category of numeric. This is because, well, they are numbers.
As you learned previously, character vectors are created with the use of quotation marks; either
We’ve already created a vector of type double,
unemp. You can check what type of vector
unemp is with
typeof(unemp) #>  "double"
Say we create another vector called
month with the numbers 1 through 12.
Notice that since we didn’t specify the
L after the numbers R defaulted to treating
month as a double. When possible it is good to make the distinction between integer and numeric.
R has a number of vectors that are built in these being the letters of the alphabet (
LETTERS respectively), as well as
month.name is already available to us so let’s not recreate it.
month.name #>  "January" "February" "March" "April" "May" "June" #>  "July" "August" "September" "October" "November" "December" typeof(month.name) #>  "character"
Notice the quotes around each vector element. This is how we identify character vectors.
Logical vectors are the last kind of vector we need to go over. Logical vectors are represented as the values
FALSE. Simple enough. Onward!
Recall that vectors are atomic meaning that there can only be one type per vector and we cannot mix and match. When a character is in the presence of another element of a different type, that value is coerced into a character. Coersion is the process of implicitly or contextually changing an object from one type to another. For example:
Something similar happens when a logical value is in the presence of a numeric value
c(TRUE, 1, FALSE) #>  1 1 0
In the presence of a numeric value
TRUE becomes equal to
FALSE equal to
0L.This behavior exists whenever a logical value is presented where a numeric is expected such as the function call below.
sum(TRUE, FALSE, FALSE) #>  1
While coersion occurs from other processes like combining values in a vector, casting is the process of intentionally changing an object’s class. There are a number of casting functions whice generaly take the shape of
as_class(). Each of the vector types covered have their own casting functions.
as.integer(TRUE) #>  1 as.character(123) #>  "123" as.double("2.331") #>  2.331 as.logical(0) #>  FALSE
As you progress in your R journey you will find scenarios in which you need to cast objects from one class to another and these functions are the trick.
You now have a strong understanding of the underbellies of R vectors. One thing that is missing is an understanding of how we can select subsets from vectors. To extract a value from vectors we append square brackets at the end of the vector
vec. We supply an index value to the square brackets to receive the value at that position
To select the month of January from the
unemp vector, the first element, we provide the value of
1 to the brackets.
unemp #>  3.2
To extract more than one value, we provide a vector of the row indexes we desire.
unemp[c(1, 3)] #>  3.2 2.8
There is yet another way to extract values from these vectors. We can provide a logical vector to our square brackets. For example, we can identify every value of
unemp that is above the average rate.
# find average removing missing values avg_unemp <- mean(unemp, na.rm = TRUE) # identify which values are above average index <- unemp > avg_unemp index #>  TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE NA NA
Notice that the
NA? They can be pesky. Hadley writes in Advanced R “missing values tend to be infectious: most computations involving a missing value will return another missing value.”58
unemp[index] #>  3.2 2.8 2.8 2.8 2.9 NA NA
How annoying those NAs can be! To prevent these NAs from showing upwe can add another condition to our
index line to remove NAs. Like there are
as.*() functions for casting, there are also
is.*() functions for testing.
is.*() returns a logical vector of the same length as the provided vector.
*is called a wildcard. The wildcard character comes from SQL and when present means that any string can follow.
is.*()is intended to indicate any possible testing function such as
is.na(unemp) #>  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
index <- unemp > avg_unemp & !is.na(unemp) unemp[index] #>  3.2 2.8 2.8 2.8 2.9
There is one last thing to keep in mind and with subsetting vectors using a logical vector that is of a different length. When you use a logical vector to subset and they are of differing length, the logical vector will be recycled for the remaining values of the vector being subset. As always, an example will be the best.
Say we have an object called
x which are the values from 0 to 10 and an
index to subset with. If we subset it with
index is a logical vector of length two with the values of
FALSE, every other observation will be returned. This is because come the third value in
x, R has ran out of values in
index to use so it goes back to the beginning
x <- 0:10 x #>  0 1 2 3 4 5 6 7 8 9 10
index <- c(TRUE, FALSE) x[index] #>  0 2 4 6 8 10
And what happens when the only value is a single logical value?
x[TRUE] #>  0 1 2 3 4 5 6 7 8 9 10
x[FALSE] #> integer(0)
In this latter case see how the output says
integer(0). This is informing you that the vector contains 0 elements.
The entirety of the work in this book so far has been with
tibbles. Tibbles are actually a special type of data frame. Data frames are R’s native way for storing rectangular data. Rectangles are two-dimensional, so are data frames.
Data frames are secretly just a bunch of vectors squished together. The important thing is that all vectors are of the same length. This ensures that each observation (row) has one value from each vector. Because of the nature of a data frame, each column must adhere to the rules of vectors.
Let’s create a tibble using the
unemp vector and the
tibble() works in a somewhat similar manner as
mutate() where the arguments we provide are name value pairs. In the case of tibble, the argument take the form of
col_name = vector.
We create a tibble with the unemployment rate below.
We can add the month name and create a new column to indicate if that month has a higher than average unemployment rate.
unemp_tbl <- tibble( unemp_rate = unemp, month = month.name ) %>% mutate(above_avg = unemp_rate > avg_unemp) unemp_tbl #> # A tibble: 12 x 3 #> unemp_rate month above_avg #> <dbl> <chr> <lgl> #> 1 3.2 January TRUE #> 2 2.8 February TRUE #> 3 2.8 March TRUE #> 4 2.4 April FALSE #> 5 2.8 May TRUE #> 6 2.9 June TRUE #> 7 2.7 July FALSE #> 8 2.6 August FALSE #> 9 2.7 September FALSE #> 10 2.3 October FALSE #> 11 NA November NA #> 12 NA December NA
To interact with the underlying vector of a data frame we can use the dollar sign
$ operator. This takes the form of
For example, extracting the
unemp_rate column looks like:
unemp_tbl$unemp_rate #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
Note the difference between
select(tbl, col) and
select(unemp_tbl, unemp_rate) #> # A tibble: 12 x 1 #> unemp_rate #> <dbl> #> 1 3.2 #> 2 2.8 #> 3 2.8 #> 4 2.4 #> 5 2.8 #> 6 2.9 #> 7 2.7 #> 8 2.6 #> 9 2.7 #> 10 2.3 #> 11 NA #> 12 NA
The difference is that
$ returns the underlying vector whereas
select() will always return another data frame. You now have the ability to both filter data and grab a subset of a vector. But we have yet to visit how to grab a single value from a data frame.
You could try something like
To grab the 10th value of the first column. But again, you still have a tibble and you are not able to use that directly like a standalone number.
We can again use brackets to subset the our R object. But data frames are two dimensional, so we need to specify the indexes in two dimensions. If you have made a hand drawn graph used a cartesian plane, which I assume you all have, this will is the same idea. With a cartesian plane we can identify any point with a combination of two values: x and y. x refers to the horizontal axis and y the vertical axis. When we put the cartesian plane in the same frame of reference as the rectangular data frame we envision our rows as the x and our columns as the y.
In specifying our index, we are able to select all rows or all columns by leaving the x or y spot empty respectively.
unemp_tbl[,1] #> # A tibble: 12 x 1 #> unemp_rate #> <dbl> #> 1 3.2 #> 2 2.8 #> 3 2.8 #> 4 2.4 #> 5 2.8 #> 6 2.9 #> 7 2.7 #> 8 2.6 #> 9 2.7 #> 10 2.3 #> 11 NA #> 12 NA unemp_tbl[10,] #> # A tibble: 1 x 3 #> unemp_rate month above_avg #> <dbl> <chr> <lgl> #> 1 2.3 October FALSE
To replicate the above tidyverse example we would provide the indexes 10 and 1 respectively.
unemp_tbl[10,1] #> # A tibble: 1 x 1 #> unemp_rate #> <dbl> #> 1 2.3
This is great, we’ve rewritten our tidyverse code in base R. But, just like the tidyverse code, we maintain the tibble data structure. This is because when we use a single bracket, it maintains the data structure of the object we are selecting from. If we wrap our brackets in another set of bracket, we are returned the an object of the same class as the underlying object.
unemp_tbl[[10,1]] #>  2.3
What that code is doing is narrowing the tibble down to a single column with a single row index and then extracting the underlying vector (the second bracket). To extract the underlying vector using the tidyverse, we can use the function
Now this brings us to the second-most fundamental structure in R: the list. Yes, second-most fundamental. I’ve been keeping a secret from you. Data frames are actually just lists in disguise. To prove it, I will remove the class from
unemp_tbl and return the class of that unclassed object.
That is right, data frames are actually just lists disguised as rectangles.
There is a good chance that you will not have to interact with them too often That doesn’t mean you shouldn’t know how to when that time comes.
Lists are generally the most flexible object type in R. Unlike vectors and data frames lists do not impose any structure on the storage of our data.
The most simple lists may resemble something like a vector.
list("Jan", "Feb", "Mar") #> [] #>  "Jan" #> #> [] #>  "Feb" #> #> [] #>  "Mar"
Notice how this prints differently than
c("Jan", "Feb", "Mar") #>  "Jan" "Feb" "Mar"
Each element of a list is self-contained. I think of lists somewhat like shipping containers where each element is its own container and all components of each element are together. We can include any type of R object in a list. For example, we can include the
unemp_tbl and associated vectors.
l <- list(unemp_tbl, unemp, month.name)
We can view the structure of the list to get an idea of what is actually contained by that list.
str(l) #> List of 3 #> $ : tibble [12 × 3] (S3: tbl_df/tbl/data.frame) #> ..$ unemp_rate: num [1:12] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 ... #> ..$ month : chr [1:12] "January" "February" "March" "April" ... #> ..$ above_avg : logi [1:12] TRUE TRUE TRUE FALSE TRUE TRUE ... #> $ : num [1:12] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 ... #> $ : chr [1:12] "January" "February" "March" "April" ...
The structure of
l shows us that the first element is a tibble (has class
tbl_df), and the other elements are numeric and character vectors respectively.
Because of this flexibility there are not predetermined dimensions that we can specify to our brackets. Like extracting the underlying vector value from a data frame we have to use
[[ for indexing. I like to think of
[ as walking up to the storage container and
[[ as actually opening it up and going inside. To get a sense of the difference lets look at the
l #> [] #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA class(l) #>  "list"
When using the single bracket we are just selecting the first element of the list which is why we are returned another list.
l[] #>  3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA class(l[]) #>  "numeric"
When we use the double bracket we are going inside of the container and actually plucking that element out of the list. Once you have plucked out that element, we can again use another set of brackets to subset that item. To grab the tenth row and first column of the
unemp_tbl inside of
l we can write.
l[][[10,1]] #>  2.3
# subsetting the data frame l[][[10,1]] #>  2.3 # grabbing the first vector then position l[][] #>  2.3 # grabbing the vector by name then position l[]$unemp_rate #>  2.3
Frankly all of these brackets can get a little messy. The tidyverse package
purrr has a super handy function called
pluck() which handles all of these brackets for us.
purrr::pluck() is meant for flexible indexing into data structures (documentation).
pluck() works by first providing the object that you’d like to index—again, notice the data first emphasis—and then providing the position of the element you would like to pluck out of the object. Generally, I will use
pluck() when possible. By doing so the code becomes more readable and adheres to a single style more thoroughly.
purrr::pluck(l, 1, 1, 10) #>  2.3
Congratulations! You made it to the end of this exceptionally dense chapter. You may feel a little overwhlemed and that is to be expected. Nonetheless you should be proud! I have a few more asks of you before you move on.
- Drink some water
- Move around a bit and shake it out
- Create a list with the vectors
- Recreate the
unemp_tblbut referencing the list elements
library(purrr) unemp_l <- list(unemp, month.name, avg_unemp) tibble( unemp_rate = pluck(unemp_l, 1), month = pluck(unemp_l, 2) ) %>% mutate(above_avg = unemp_rate > pluck(unemp_l, 3)) #> # A tibble: 12 x 3 #> unemp_rate month above_avg #> <dbl> <chr> <lgl> #> 1 3.2 January TRUE #> 2 2.8 February TRUE #> 3 2.8 March TRUE #> 4 2.4 April FALSE #> 5 2.8 May TRUE #> 6 2.9 June TRUE #> 7 2.7 July FALSE #> 8 2.6 August FALSE #> 9 2.7 September FALSE #> 10 2.3 October FALSE #> 11 NA November NA #> 12 NA December NA