14 Data Structures
There is a topic I have been skirting around for some time now and I think it is time that we have to have a rather important conversation. It’s one that is almost never fun but is quite necessary because without it, there may be many painful lessons learned in the future. We’re going to spend this next chapter talking about data structures—but not all of them! We’ll only cover the three most common and, by the end of this, it is my hope you will have a much stronger idea of what you are working with and why it behaves the way it does.
We will cover vectors, data frames rather briefly, and lists. We’ll talk about some of their defining characteristics and how we can interact with them. Often the theory behind these object types are omitted, but I am of the mind that learning this early on will pay off in dividends. Take a deep breath before we dive in and remind yourself that it ain’t nothin’ but a thing.
This section is undoubtedly the most theoretically dense from a software perspective of this entire book. These concepts may be a little bit difficult to grasp at the first go around particularly if you do not have a programming background. But do not be discouraged! This is tough and there is no way to around it, so might as well go through it. If you can grasp this chapter programming in R will become so much easier. You will develop an intuition of why certain things happen to your data and how to interact with other data structures.
14.1 Atomic Vectors
I like to think of the atomic vector much like the atom—that is as the building block of any R object. You’ve actually been working with atomic vectors this entire time. But we haven’t been very explicit about this yet. Up until this point we have been working mainly with tibbles. And here is the secret: each column of a tibble is actually an atomic vector.
What makes a vector a atomic is that it can only be a single data type and that they are one-dimensional—opposed to tibbles which are two-dimensional56. You may have noticed that every value of a column is of the same data type. This means that they are rather strict to work with and for good reason. Imagine you wanted to multiple a column by 10, what would happen if a few of the values in the column were actually written out as text? Let’s try exploring this idea.
The most common way to create a vector in R is to use the c()
function. This stands for combine. We can c
ombine as many elements as we want into a single vector using c()
. Each element of the vector is it’s own argument (separated by a comma).
For example if we wanted to create a vector of Boston’s unemployment rate rate for each month in 2019 that we have data for (until October as of this writing on Dec. 18th, 2019) we could write the below. We will save it in a a vector called unemp
57.
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3)
unemp
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3
What is really great about vectors is that we can perform any number of operations on them—i.e. find the sum of all the values, the average, add a value to each element, etc.
If we wanted to find the average unemployment rate for Boston for Jan - Oct. 2019, we can supply the vector to the function mean()
.
mean(unemp)
#> [1] 2.72
However, you may be thinking “there are 12 months in a year not 10 and that should be represented” and if you are, I totally agree with you. Since the data for November and December are missing, we should denote that and update unemp
accordingly. R uses NA
to represent missing data. To represent this we can append two NA
s to the vector we have. There are two ways we can do this. We can either combine unemp
with two NA
s, or rewrite the above vector.
# combining existing with 2 NAs
c(unemp, NA, NA)
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
This works, but since we will be saving this to unemp
again it is not best practices to use the variable you are changing in that objects assignment.
# for example
unemp <- c(unemp, NA, NA)
The above is rather unclear and might confuse someone that will have to read your code at a later time—that person may even be you. For this reason we will redefine it.
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3, NA, NA)
unemp
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
We know that there are 12 elements in this vector, but sometimes it is quite nice to sanity check oneself. We can always find out how long (or how many elements are in) a vector is by supplying the vector to the length()
function.
# how many observations are in `unemp`?
length(unemp)
#> [1] 12
There are a total of six types of vectors. Fortunately, only four of these really matter to us. These are integer
, double
, character
, logical
.
Integers represent whole numbers. To specify an integer we append an L
after the number such as 20L
. Doubles are any number that requires any precision aka decimal places. You can specify doubles in a number of formats such as scientific notation. Generally the easiest way to do this, though, is using a decimal. Together integers and doubles are lumped into the category of numeric. This is because, well, they are numbers.
As you learned previously, character vectors are created with the use of quotation marks; either "
or '
.
We’ve already created a vector of type double, unemp
. You can check what type of vector unemp
is with typeof()
Note:
typeof()
is used only for internal R object such as lists and vectors. In most cases you will want to useclass()
to return the class of an object.
typeof(unemp)
#> [1] "double"
Say we create another vector called month
with the numbers 1 through 12.
Notice that since we didn’t specify the L
after the numbers R defaulted to treating month
as a double. When possible it is good to make the distinction between integer and numeric.
R has a number of vectors that are built in these being the letters of the alphabet (letters
and LETTERS
respectively), as well as month.abb
, month.name
, and pi
. month.name
is already available to us so let’s not recreate it.
month.name
#> [1] "January" "February" "March" "April" "May" "June"
#> [7] "July" "August" "September" "October" "November" "December"
typeof(month.name)
#> [1] "character"
Notice the quotes around each vector element. This is how we identify character vectors.
Logical vectors are the last kind of vector we need to go over. Logical vectors are represented as the values TRUE
and FALSE
. Simple enough. Onward!
Recall that vectors are atomic meaning that there can only be one type per vector and we cannot mix and match. When a character is in the presence of another element of a different type, that value is coerced into a character. Coersion is the process of implicitly or contextually changing an object from one type to another. For example:
Something similar happens when a logical value is in the presence of a numeric value
c(TRUE, 1, FALSE)
#> [1] 1 1 0
In the presence of a numeric value TRUE
becomes equal to 1L
and FALSE
equal to 0L
.This behavior exists whenever a logical value is presented where a numeric is expected such as the function call below.
sum(TRUE, FALSE, FALSE)
#> [1] 1
While coersion occurs from other processes like combining values in a vector, casting is the process of intentionally changing an object’s class. There are a number of casting functions whice generaly take the shape of as.class()
or as_class()
. Each of the vector types covered have their own casting functions.
as.integer(TRUE)
#> [1] 1
as.character(123)
#> [1] "123"
as.double("2.331")
#> [1] 2.331
as.logical(0)
#> [1] FALSE
As you progress in your R journey you will find scenarios in which you need to cast objects from one class to another and these functions are the trick.
You now have a strong understanding of the underbellies of R vectors. One thing that is missing is an understanding of how we can select subsets from vectors. To extract a value from vectors we append square brackets at the end of the vector vec[]
. We supply an index value to the square brackets to receive the value at that position
To select the month of January from the unemp
vector, the first element, we provide the value of 1
to the brackets.
unemp[1]
#> [1] 3.2
To extract more than one value, we provide a vector of the row indexes we desire.
unemp[c(1, 3)]
#> [1] 3.2 2.8
There is yet another way to extract values from these vectors. We can provide a logical vector to our square brackets. For example, we can identify every value of unemp
that is above the average rate.
# find average removing missing values
avg_unemp <- mean(unemp, na.rm = TRUE)
# identify which values are above average
index <- unemp > avg_unemp
index
#> [1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE NA NA
Notice that the NA
s stayed NA
? They can be pesky. Hadley writes in Advanced R “missing values tend to be infectious: most computations involving a missing value will return another missing value.”58
unemp[index]
#> [1] 3.2 2.8 2.8 2.8 2.9 NA NA
How annoying those NAs can be! To prevent these NAs from showing upwe can add another condition to our index
line to remove NAs. Like there are as.*()
functions for casting, there are also is.*()
functions for testing. is.*()
returns a logical vector of the same length as the provided vector.
Note: the
*
is called a wildcard. The wildcard character comes from SQL and when present means that any string can follow.is.*()
is intended to indicate any possible testing function such asis.numeric()
,is_tibble()
, etc.
is.na(unemp)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
As we learned, we can negate logical vectors with an !
. We can negate the test results and include an an &
condition to only identify unemployment values above average and aren’t missing.
index <- unemp > avg_unemp & !is.na(unemp)
unemp[index]
#> [1] 3.2 2.8 2.8 2.8 2.9
There is one last thing to keep in mind and with subsetting vectors using a logical vector that is of a different length. When you use a logical vector to subset and they are of differing length, the logical vector will be recycled for the remaining values of the vector being subset. As always, an example will be the best.
Say we have an object called x
which are the values from 0 to 10 and an index
to subset with. If we subset it with index
and index
is a logical vector of length two with the values of TRUE
and FALSE
, every other observation will be returned. This is because come the third value in x
, R has ran out of values in index
to use so it goes back to the beginning
x <- 0:10
x
#> [1] 0 1 2 3 4 5 6 7 8 9 10
index <- c(TRUE, FALSE)
x[index]
#> [1] 0 2 4 6 8 10
And what happens when the only value is a single logical value?
x[TRUE]
#> [1] 0 1 2 3 4 5 6 7 8 9 10
x[FALSE]
#> integer(0)
In this latter case see how the output says integer(0)
. This is informing you that the vector contains 0 elements.
14.2 Data Frames
The entirety of the work in this book so far has been with tibbles
. Tibbles are actually a special type of data frame. Data frames are R’s native way for storing rectangular data. Rectangles are two-dimensional, so are data frames.
Data frames are secretly just a bunch of vectors squished together. The important thing is that all vectors are of the same length. This ensures that each observation (row) has one value from each vector. Because of the nature of a data frame, each column must adhere to the rules of vectors.
Let’s create a tibble using the unemp
vector and the tibble()
function. tibble()
works in a somewhat similar manner as mutate()
where the arguments we provide are name value pairs. In the case of tibble, the argument take the form of col_name = vector
.
We create a tibble with the unemployment rate below.
library(dplyr)
tibble(
unemp_rate = unemp
)
#> # A tibble: 12 x 1
#> unemp_rate
#> <dbl>
#> 1 3.2
#> 2 2.8
#> 3 2.8
#> 4 2.4
#> 5 2.8
#> 6 2.9
#> 7 2.7
#> 8 2.6
#> 9 2.7
#> 10 2.3
#> 11 NA
#> 12 NA
We can add the month name and create a new column to indicate if that month has a higher than average unemployment rate.
unemp_tbl <- tibble(
unemp_rate = unemp,
month = month.name
) %>%
mutate(above_avg = unemp_rate > avg_unemp)
unemp_tbl
#> # A tibble: 12 x 3
#> unemp_rate month above_avg
#> <dbl> <chr> <lgl>
#> 1 3.2 January TRUE
#> 2 2.8 February TRUE
#> 3 2.8 March TRUE
#> 4 2.4 April FALSE
#> 5 2.8 May TRUE
#> 6 2.9 June TRUE
#> 7 2.7 July FALSE
#> 8 2.6 August FALSE
#> 9 2.7 September FALSE
#> 10 2.3 October FALSE
#> 11 NA November NA
#> 12 NA December NA
To interact with the underlying vector of a data frame we can use the dollar sign $
operator. This takes the form of tbl$col_name
.
For example, extracting the unemp_rate
column looks like:
unemp_tbl$unemp_rate
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
Note the difference between select(tbl, col)
and tbl$col
.
select(unemp_tbl, unemp_rate)
#> # A tibble: 12 x 1
#> unemp_rate
#> <dbl>
#> 1 3.2
#> 2 2.8
#> 3 2.8
#> 4 2.4
#> 5 2.8
#> 6 2.9
#> 7 2.7
#> 8 2.6
#> 9 2.7
#> 10 2.3
#> 11 NA
#> 12 NA
The difference is that $
returns the underlying vector whereas select()
will always return another data frame. You now have the ability to both filter data and grab a subset of a vector. But we have yet to visit how to grab a single value from a data frame.
You could try something like
To grab the 10th value of the first column. But again, you still have a tibble and you are not able to use that directly like a standalone number.
We can again use brackets to subset the our R object. But data frames are two dimensional, so we need to specify the indexes in two dimensions. If you have made a hand drawn graph used a cartesian plane, which I assume you all have, this will is the same idea. With a cartesian plane we can identify any point with a combination of two values: x and y. x refers to the horizontal axis and y the vertical axis. When we put the cartesian plane in the same frame of reference as the rectangular data frame we envision our rows as the x and our columns as the y.
In specifying our index, we are able to select all rows or all columns by leaving the x or y spot empty respectively.
unemp_tbl[,1]
#> # A tibble: 12 x 1
#> unemp_rate
#> <dbl>
#> 1 3.2
#> 2 2.8
#> 3 2.8
#> 4 2.4
#> 5 2.8
#> 6 2.9
#> 7 2.7
#> 8 2.6
#> 9 2.7
#> 10 2.3
#> 11 NA
#> 12 NA
unemp_tbl[10,]
#> # A tibble: 1 x 3
#> unemp_rate month above_avg
#> <dbl> <chr> <lgl>
#> 1 2.3 October FALSE
To replicate the above tidyverse example we would provide the indexes 10 and 1 respectively.
unemp_tbl[10,1]
#> # A tibble: 1 x 1
#> unemp_rate
#> <dbl>
#> 1 2.3
This is great, we’ve rewritten our tidyverse code in base R. But, just like the tidyverse code, we maintain the tibble data structure. This is because when we use a single bracket, it maintains the data structure of the object we are selecting from. If we wrap our brackets in another set of bracket, we are returned the an object of the same class as the underlying object.
unemp_tbl[[10,1]]
#> [1] 2.3
What that code is doing is narrowing the tibble down to a single column with a single row index and then extracting the underlying vector (the second bracket). To extract the underlying vector using the tidyverse, we can use the function dplyr::pull()
.
Now this brings us to the second-most fundamental structure in R: the list. Yes, second-most fundamental. I’ve been keeping a secret from you. Data frames are actually just lists in disguise. To prove it, I will remove the class from unemp_tbl
and return the class of that unclassed object.
That is right, data frames are actually just lists disguised as rectangles.
14.3 Lists
There is a good chance that you will not have to interact with them too often That doesn’t mean you shouldn’t know how to when that time comes.
Lists are generally the most flexible object type in R. Unlike vectors and data frames lists do not impose any structure on the storage of our data.
The most simple lists may resemble something like a vector.
list("Jan", "Feb", "Mar")
#> [[1]]
#> [1] "Jan"
#>
#> [[2]]
#> [1] "Feb"
#>
#> [[3]]
#> [1] "Mar"
Notice how this prints differently than
c("Jan", "Feb", "Mar")
#> [1] "Jan" "Feb" "Mar"
Each element of a list is self-contained. I think of lists somewhat like shipping containers where each element is its own container and all components of each element are together. We can include any type of R object in a list. For example, we can include the unemp_tbl
and associated vectors.
l <- list(unemp_tbl, unemp, month.name)
We can view the structure of the list to get an idea of what is actually contained by that list.
str(l)
#> List of 3
#> $ : tibble [12 × 3] (S3: tbl_df/tbl/data.frame)
#> ..$ unemp_rate: num [1:12] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 ...
#> ..$ month : chr [1:12] "January" "February" "March" "April" ...
#> ..$ above_avg : logi [1:12] TRUE TRUE TRUE FALSE TRUE TRUE ...
#> $ : num [1:12] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 ...
#> $ : chr [1:12] "January" "February" "March" "April" ...
The structure of l
shows us that the first element is a tibble (has class tbl_df
), and the other elements are numeric and character vectors respectively.
Because of this flexibility there are not predetermined dimensions that we can specify to our brackets. Like extracting the underlying vector value from a data frame we have to use [[
for indexing. I like to think of [
as walking up to the storage container and [[
as actually opening it up and going inside. To get a sense of the difference lets look at the unemp
vector.
l[2]
#> [[1]]
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
class(l[2])
#> [1] "list"
When using the single bracket we are just selecting the first element of the list which is why we are returned another list.
l[[2]]
#> [1] 3.2 2.8 2.8 2.4 2.8 2.9 2.7 2.6 2.7 2.3 NA NA
class(l[[2]])
#> [1] "numeric"
When we use the double bracket we are going inside of the container and actually plucking that element out of the list. Once you have plucked out that element, we can again use another set of brackets to subset that item. To grab the tenth row and first column of the unemp_tbl
inside of l
we can write.
l[[1]][[10,1]]
#> [1] 2.3
Now that we know that data frames are lists we can actually extract the underlying vectors using [[
as well as $
. We can get the tenth row and first column a number of ways.
# subsetting the data frame
l[[1]][[10,1]]
#> [1] 2.3
# grabbing the first vector then position
l[[1]][[1]][10]
#> [1] 2.3
# grabbing the vector by name then position
l[[1]]$unemp_rate[10]
#> [1] 2.3
Frankly all of these brackets can get a little messy. The tidyverse package purrr
has a super handy function called pluck()
which handles all of these brackets for us. purrr::pluck()
is meant for flexible indexing into data structures (documentation).
pluck()
works by first providing the object that you’d like to index—again, notice the data first emphasis—and then providing the position of the element you would like to pluck out of the object. Generally, I will use pluck()
when possible. By doing so the code becomes more readable and adheres to a single style more thoroughly.
purrr::pluck(l, 1, 1, 10)
#> [1] 2.3
Congratulations! You made it to the end of this exceptionally dense chapter. You may feel a little overwhlemed and that is to be expected. Nonetheless you should be proud! I have a few more asks of you before you move on.
14.3.1 Exercises
- Drink some water
- Move around a bit and shake it out
- Create a list with the vectors
unemp
,month.name
, andavg_unemp
. - Recreate the
unemp_tbl
but referencing the list elements
library(purrr)
unemp_l <- list(unemp, month.name, avg_unemp)
tibble(
unemp_rate = pluck(unemp_l, 1),
month = pluck(unemp_l, 2)
) %>%
mutate(above_avg = unemp_rate > pluck(unemp_l, 3))
#> # A tibble: 12 x 3
#> unemp_rate month above_avg
#> <dbl> <chr> <lgl>
#> 1 3.2 January TRUE
#> 2 2.8 February TRUE
#> 3 2.8 March TRUE
#> 4 2.4 April FALSE
#> 5 2.8 May TRUE
#> 6 2.9 June TRUE
#> 7 2.7 July FALSE
#> 8 2.6 August FALSE
#> 9 2.7 September FALSE
#> 10 2.3 October FALSE
#> 11 NA November NA
#> 12 NA December NA