# 16 Summarizing the tidy way

Now that we have a basic understanding of how to manipulate our dataset, summarising the dataset into a few useful metrics is important. When we have massive datasets with many subgroups, summary statistics will be very important for distilling all of that information into something consumable. Aggregation will also be very important for visualization purposes.

We have already reviewed what constitutes a summary statistic and how to create them working with vectors. But we have not done so within the context of the tidyverse. We have figured out how to select, filter, mutate and all within a chain of functions. But we have not followed this to its natural next step, the `group_by()`

and `summarise()`

functions.

dplyr includes a wonderful helper function called `count()`

. It does just what it says it does. It counts the number of observations in a tibble. Let’s explore this!

We can also count by groups in a data set. For example, we can count how many observations there are per county.

```
count(commute, county)
#> # A tibble: 3 x 2
#> county n
#> <chr> <int>
#> 1 MIDDLESEX 318
#> 2 NORFOLK 130
#> 3 SUFFOLK 200
```

`count()`

is actually a wrapper around the function `summarise()`

which is a much more flexible function. `summarise()`

is the aggregate analog to `mutate()`

. The difference between `mutate()`

and `summarise()`

is that the result of an expression in `mutate()`

must have the same number of values as there are rows—unless of course you are specifying a scalar value like `TRUE`

—whereas `summarise()`

requires the result to be one an element of length one.

Notes: - A wrapper is function that executes another function. - A scalar is a vector of length one.

We can recreate the first above `count()`

call with `summarise()`

and the handy `n()`

function we learned a while ago. Here we follow the same pattern of assigning column names to expressions as we do with.

Like `mutate()`

there is no restriction on the number of new columns we can create. Previously we calculated the min, max, mean, and standard deviation of the `commute3060`

variable. This is done rather neatly with `summarise()`

.

```
commute %>%
summarise(
min_commute = min(commute3060, na.rm = TRUE),
max_commute = max(commute3060, na.rm = TRUE),
avg_commute = mean(commute3060, na.rm = TRUE),
sd_commute = sd(commute3060, na.rm = TRUE)
)
#> # A tibble: 1 x 4
#> min_commute max_commute avg_commute sd_commute
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0.633 0.390 0.0862
```

Frankly this alone is somewhat unimpressive. The power of `summarise()`

comes from incorporating `group_by()`

into the function chain. `group_by()`

allows us to explicitly identify groups within a tibble as defined by a given variable. The resulting tibble from a `group_by()`

call is seemingly unchanged.

```
commute %>%
group_by(county)
#> # A tibble: 648 x 14
#> # Groups: county [3]
#> county hs_grad bach master commute_less10 commute1030 commute3060
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MIDDL… 0.389 0.188 0.100 0.0916 0.357 0.375
#> 2 MIDDL… 0.167 0.400 0.130 0.0948 0.445 0.344
#> 3 MIDDL… 0.184 0.317 0.139 0.0720 0.404 0.382
#> 4 MIDDL… 0.258 0.322 0.144 0.0983 0.390 0.379
#> 5 MIDDL… 0.301 0.177 0.0742 0.0670 0.379 0.365
#> 6 MIDDL… 0.159 0.310 0.207 0.0573 0.453 0.352
#> 7 MIDDL… 0.268 0.247 0.149 0.0791 0.475 0.368
#> 8 MIDDL… 0.261 0.300 0.126 0.137 0.450 0.337
#> 9 MIDDL… 0.315 0.198 0.140 0.0752 0.478 0.329
#> 10 MIDDL… 0.151 0.348 0.151 0.0830 0.474 0.322
#> # … with 638 more rows, and 7 more variables: commute6090 <dbl>,
#> # commute_over90 <dbl>, by_auto <dbl>, by_pub_trans <dbl>, by_bike <dbl>,
#> # by_walk <dbl>, med_house_income <dbl>
```

However, if we look at comments above the tibble, we see something new: `# Groups: county [3]`

. This tells us a couple of things. First that the groups were created using the `county`

column, that there are fifteen groups, and that the data frame is now grouped implying that any future `mutate()`

or `summarise()`

calls will be performed on the specified groups. If we then look at the class of that grouped tibble we see that there is a new class introduced which is `grouped_df`

.

Note: a tibble has the classes

`tbl`

and`tbl_df`

on top of the Base R class`data.frame`

.

When a tibble has this object class, dplyr knows that operations should be grouped. For example if you were to calculate the mean, this would be the mean for the specified groups rather than the mean for the entire dataset.
One function that is extremely useful is the `n()`

function to identify how many observations there are per group inside of a mutate call.

I am including the

`commute3060`

column to illustrate that the new`n`

column will be the same for each group value.

```
commute %>%
group_by(county) %>%
mutate(n = n()) %>%
select(county, commute3060, n)
#> # A tibble: 648 x 3
#> # Groups: county [3]
#> county commute3060 n
#> <chr> <dbl> <int>
#> 1 MIDDLESEX 0.375 318
#> 2 MIDDLESEX 0.344 318
#> 3 MIDDLESEX 0.382 318
#> 4 MIDDLESEX 0.379 318
#> 5 MIDDLESEX 0.365 318
#> 6 MIDDLESEX 0.352 318
#> 7 MIDDLESEX 0.368 318
#> 8 MIDDLESEX 0.337 318
#> 9 MIDDLESEX 0.329 318
#> 10 MIDDLESEX 0.322 318
#> # … with 638 more rows
```

Here each group only has one unique value for `n`

. As discussed previously, when we want to calculate aggregate measures, there ought to only value per-group. This ability to perform grouped calculation within `mutate()`

can be extremely powerful, but does not create a proper aggregated dataset. For this, we can again use `summarise()`

Let’s recreate the grouped count from before.

```
commute %>%
group_by(county) %>%
summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#> county n
#> <chr> <int>
#> 1 MIDDLESEX 318
#> 2 NORFOLK 130
#> 3 SUFFOLK 200
```

We can also include the summary statistic calculations from before.

```
commute %>%
group_by(county) %>%
summarise(
n = n(),
min_commute = min(commute3060, na.rm = TRUE),
max_commute = max(commute3060, na.rm = TRUE),
avg_commute = mean(commute3060, na.rm = TRUE),
sd_commute = sd(commute3060, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 6
#> county n min_commute max_commute avg_commute sd_commute
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 MIDDLESEX 318 0.104 0.612 0.383 0.0825
#> 2 NORFOLK 130 0.186 0.613 0.392 0.0761
#> 3 SUFFOLK 200 0 0.633 0.400 0.0972
```