Now that we have a basic understanding of how to manipulate our dataset, summarising the dataset into a few useful metrics is important. When we have massive datasets with many subgroups, summary statistics will be very important for distilling all of that information into something consumable. Aggregation will also be very important for visualization purposes.
We have already reviewed what constitutes a summary statistic and how to create them working with vectors. But we have not done so within the context of the tidyverse. We have figured out how to select, filter, mutate and all within a chain of functions. But we have not followed this to its natural next step, the
dplyr includes a wonderful helper function called
count(). It does just what it says it does. It counts the number of observations in a tibble. Let’s explore this!
We can also count by groups in a data set. For example, we can count how many observations there are per county.
count(commute, county) #> # A tibble: 3 x 2 #> county n #> <chr> <int> #> 1 MIDDLESEX 318 #> 2 NORFOLK 130 #> 3 SUFFOLK 200
count() is actually a wrapper around the function
summarise() which is a much more flexible function.
summarise() is the aggregate analog to
mutate(). The difference between
summarise() is that the result of an expression in
mutate() must have the same number of values as there are rows—unless of course you are specifying a scalar value like
summarise() requires the result to be one an element of length one.
Notes: - A wrapper is function that executes another function. - A scalar is a vector of length one.
We can recreate the first above
count() call with
summarise() and the handy
n() function we learned a while ago. Here we follow the same pattern of assigning column names to expressions as we do with.
mutate() there is no restriction on the number of new columns we can create. Previously we calculated the min, max, mean, and standard deviation of the
commute3060 variable. This is done rather neatly with
commute %>% summarise( min_commute = min(commute3060, na.rm = TRUE), max_commute = max(commute3060, na.rm = TRUE), avg_commute = mean(commute3060, na.rm = TRUE), sd_commute = sd(commute3060, na.rm = TRUE) ) #> # A tibble: 1 x 4 #> min_commute max_commute avg_commute sd_commute #> <dbl> <dbl> <dbl> <dbl> #> 1 0 0.633 0.390 0.0862
Frankly this alone is somewhat unimpressive. The power of
summarise() comes from incorporating
group_by() into the function chain.
group_by() allows us to explicitly identify groups within a tibble as defined by a given variable. The resulting tibble from a
group_by() call is seemingly unchanged.
commute %>% group_by(county) #> # A tibble: 648 x 14 #> # Groups: county  #> county hs_grad bach master commute_less10 commute1030 commute3060 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 MIDDL… 0.389 0.188 0.100 0.0916 0.357 0.375 #> 2 MIDDL… 0.167 0.400 0.130 0.0948 0.445 0.344 #> 3 MIDDL… 0.184 0.317 0.139 0.0720 0.404 0.382 #> 4 MIDDL… 0.258 0.322 0.144 0.0983 0.390 0.379 #> 5 MIDDL… 0.301 0.177 0.0742 0.0670 0.379 0.365 #> 6 MIDDL… 0.159 0.310 0.207 0.0573 0.453 0.352 #> 7 MIDDL… 0.268 0.247 0.149 0.0791 0.475 0.368 #> 8 MIDDL… 0.261 0.300 0.126 0.137 0.450 0.337 #> 9 MIDDL… 0.315 0.198 0.140 0.0752 0.478 0.329 #> 10 MIDDL… 0.151 0.348 0.151 0.0830 0.474 0.322 #> # … with 638 more rows, and 7 more variables: commute6090 <dbl>, #> # commute_over90 <dbl>, by_auto <dbl>, by_pub_trans <dbl>, by_bike <dbl>, #> # by_walk <dbl>, med_house_income <dbl>
However, if we look at comments above the tibble, we see something new:
# Groups: county . This tells us a couple of things. First that the groups were created using the
county column, that there are fifteen groups, and that the data frame is now grouped implying that any future
summarise() calls will be performed on the specified groups. If we then look at the class of that grouped tibble we see that there is a new class introduced which is
Note: a tibble has the classes
tbl_dfon top of the Base R class
When a tibble has this object class, dplyr knows that operations should be grouped. For example if you were to calculate the mean, this would be the mean for the specified groups rather than the mean for the entire dataset.
One function that is extremely useful is the
n() function to identify how many observations there are per group inside of a mutate call.
I am including the
commute3060column to illustrate that the new
ncolumn will be the same for each group value.
commute %>% group_by(county) %>% mutate(n = n()) %>% select(county, commute3060, n) #> # A tibble: 648 x 3 #> # Groups: county  #> county commute3060 n #> <chr> <dbl> <int> #> 1 MIDDLESEX 0.375 318 #> 2 MIDDLESEX 0.344 318 #> 3 MIDDLESEX 0.382 318 #> 4 MIDDLESEX 0.379 318 #> 5 MIDDLESEX 0.365 318 #> 6 MIDDLESEX 0.352 318 #> 7 MIDDLESEX 0.368 318 #> 8 MIDDLESEX 0.337 318 #> 9 MIDDLESEX 0.329 318 #> 10 MIDDLESEX 0.322 318 #> # … with 638 more rows
Here each group only has one unique value for
n. As discussed previously, when we want to calculate aggregate measures, there ought to only value per-group. This ability to perform grouped calculation within
mutate()can be extremely powerful, but does not create a proper aggregated dataset. For this, we can again use
Let’s recreate the grouped count from before.
We can also include the summary statistic calculations from before.
commute %>% group_by(county) %>% summarise( n = n(), min_commute = min(commute3060, na.rm = TRUE), max_commute = max(commute3060, na.rm = TRUE), avg_commute = mean(commute3060, na.rm = TRUE), sd_commute = sd(commute3060, na.rm = TRUE) ) #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 3 x 6 #> county n min_commute max_commute avg_commute sd_commute #> <chr> <int> <dbl> <dbl> <dbl> <dbl> #> 1 MIDDLESEX 318 0.104 0.612 0.383 0.0825 #> 2 NORFOLK 130 0.186 0.613 0.392 0.0761 #> 3 SUFFOLK 200 0 0.633 0.400 0.0972