Announcement

Acknowledgement

Dr. Hua Zhou’s slides

rm(list = ls()) # clean-up workspace

A typical data science project:

nycflights13 data

dplyr basics

in conjunction with group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

Manipulate rows (cases)

Filter rows with filter()

  • Flights on Jan 1st:

    # same as filter(flights, month == 1 & day == 1)
    filter(flights, month == 1, day == 1)
    ## # A tibble: 842 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      517            515         2      830            819
    ##  2  2013     1     1      533            529         4      850            830
    ##  3  2013     1     1      542            540         2      923            850
    ##  4  2013     1     1      544            545        -1     1004           1022
    ##  5  2013     1     1      554            600        -6      812            837
    ##  6  2013     1     1      554            558        -4      740            728
    ##  7  2013     1     1      555            600        -5      913            854
    ##  8  2013     1     1      557            600        -3      709            723
    ##  9  2013     1     1      557            600        -3      838            846
    ## 10  2013     1     1      558            600        -2      753            745
    ## # ℹ 832 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

  • Flights in Nov or Dec:

    filter(flights, month == 11 | month == 12)
    ## # A tibble: 55,403 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013    11     1        5           2359         6      352            345
    ##  2  2013    11     1       35           2250       105      123           2356
    ##  3  2013    11     1      455            500        -5      641            651
    ##  4  2013    11     1      539            545        -6      856            827
    ##  5  2013    11     1      542            545        -3      831            855
    ##  6  2013    11     1      549            600       -11      912            923
    ##  7  2013    11     1      550            600       -10      705            659
    ##  8  2013    11     1      554            600        -6      659            701
    ##  9  2013    11     1      554            600        -6      826            827
    ## 10  2013    11     1      554            600        -6      749            751
    ## # ℹ 55,393 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

    equavalent as

    filter(flights, month %in% c(11, 12))
    ## # A tibble: 55,403 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013    11     1        5           2359         6      352            345
    ##  2  2013    11     1       35           2250       105      123           2356
    ##  3  2013    11     1      455            500        -5      641            651
    ##  4  2013    11     1      539            545        -6      856            827
    ##  5  2013    11     1      542            545        -3      831            855
    ##  6  2013    11     1      549            600       -11      912            923
    ##  7  2013    11     1      550            600       -10      705            659
    ##  8  2013    11     1      554            600        -6      659            701
    ##  9  2013    11     1      554            600        -6      826            827
    ## 10  2013    11     1      554            600        -6      749            751
    ## # ℹ 55,393 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Missing values

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values.

If you want to preserve missing values, ask for them explicitly

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
## # A tibble: 1 × 1
##       x
##   <dbl>
## 1     3
filter(df, is.na(x) | x > 1)
## # A tibble: 2 × 1
##       x
##   <dbl>
## 1    NA
## 2     3

Remove rows with duplicate values

  • One row from each month:

    distinct(flights, month, .keep_all = TRUE)
    ## # A tibble: 12 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      517            515         2      830            819
    ##  2  2013    10     1      447            500       -13      614            648
    ##  3  2013    11     1        5           2359         6      352            345
    ##  4  2013    12     1       13           2359        14      446            445
    ##  5  2013     2     1      456            500        -4      652            648
    ##  6  2013     3     1        4           2159       125      318             56
    ##  7  2013     4     1      454            500        -6      636            640
    ##  8  2013     5     1        9           1655       434      308           2020
    ##  9  2013     6     1        2           2359         3      341            350
    ## 10  2013     7     1        1           2029       212      236           2359
    ## 11  2013     8     1       12           2130       162      257             14
    ## 12  2013     9     1        9           2359        10      343            340
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>
  • With .keep_all = FALSE, only distinct values of the variable are selected:

    distinct(flights, month)
    ## # A tibble: 12 × 1
    ##    month
    ##    <int>
    ##  1     1
    ##  2    10
    ##  3    11
    ##  4    12
    ##  5     2
    ##  6     3
    ##  7     4
    ##  8     5
    ##  9     6
    ## 10     7
    ## 11     8
    ## 12     9

Sample rows

  • Randomly select n rows:

    slice_sample(flights, n = 10, replace = TRUE)
    ## # A tibble: 10 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     2    11     1842           1815        27     2116           2125
    ##  2  2013    12     2     1539           1520        19     1858           1839
    ##  3  2013     3    26      814            815        -1      948           1010
    ##  4  2013     5    13     1202           1210        -8     1433           1510
    ##  5  2013     3     6       NA            608        NA       NA            724
    ##  6  2013     5    15     2011           2000        11     2127           2112
    ##  7  2013     1    22     1344           1345        -1     1620           1649
    ##  8  2013     3     3     1839           1835         4     2022           2040
    ##  9  2013     9    26     1808           1715        53     2056           2015
    ## 10  2013     2    21     1055           1059        -4     1402           1441
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

  • Randomly select fraction of rows:

    slice_sample(flights, prop = 0.1, replace = TRUE)
    ## # A tibble: 33,677 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     9     4      700            705        -5      953            955
    ##  2  2013     3    18     1026           1030        -4     1414           1355
    ##  3  2013     6     3     1802           1805        -3     2115           2128
    ##  4  2013    10     7     2049           1910        99     2317           2149
    ##  5  2013     4    28     1653           1659        -6     1933           1959
    ##  6  2013     6    20     1627           1630        -3     1907           1933
    ##  7  2013     6    18     1446           1355        51     1729           1643
    ##  8  2013     3    27     1916           1925        -9     2116           2128
    ##  9  2013     3    14      551            600        -9      840            910
    ## 10  2013    12    28      656            700        -4      918            944
    ## # ℹ 33,667 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Select rows by position

  • Select rows by position:

    slice(flights, 1:5)
    ## # A tibble: 5 × 19
    ##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ## 1  2013     1     1      517            515         2      830            819
    ## 2  2013     1     1      533            529         4      850            830
    ## 3  2013     1     1      542            540         2      923            850
    ## 4  2013     1     1      544            545        -1     1004           1022
    ## 5  2013     1     1      554            600        -6      812            837
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

  • Top n rows:

    slice_head(flights, n = 5)
    ## # A tibble: 5 × 19
    ##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ## 1  2013     1     1      517            515         2      830            819
    ## 2  2013     1     1      533            529         4      850            830
    ## 3  2013     1     1      542            540         2      923            850
    ## 4  2013     1     1      544            545        -1     1004           1022
    ## 5  2013     1     1      554            600        -6      812            837
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

    Use prop argument to specify proportions instead

  • Bottom n rows:

    slice_tail(flights, n = 5)
    ## # A tibble: 5 × 19
    ##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ## 1  2013     9    30       NA           1455        NA       NA           1634
    ## 2  2013     9    30       NA           2200        NA       NA           2312
    ## 3  2013     9    30       NA           1210        NA       NA           1330
    ## 4  2013     9    30       NA           1159        NA       NA           1344
    ## 5  2013     9    30       NA            840        NA       NA           1020
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Arrange rows with arrange()

  • Sort in ascending order:

    arrange(flights, year, month, day)
    ## # A tibble: 336,776 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      517            515         2      830            819
    ##  2  2013     1     1      533            529         4      850            830
    ##  3  2013     1     1      542            540         2      923            850
    ##  4  2013     1     1      544            545        -1     1004           1022
    ##  5  2013     1     1      554            600        -6      812            837
    ##  6  2013     1     1      554            558        -4      740            728
    ##  7  2013     1     1      555            600        -5      913            854
    ##  8  2013     1     1      557            600        -3      709            723
    ##  9  2013     1     1      557            600        -3      838            846
    ## 10  2013     1     1      558            600        -2      753            745
    ## # ℹ 336,766 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

  • Sort in descending order:

    arrange(flights, desc(arr_delay))
    ## # A tibble: 336,776 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     9      641            900      1301     1242           1530
    ##  2  2013     6    15     1432           1935      1137     1607           2120
    ##  3  2013     1    10     1121           1635      1126     1239           1810
    ##  4  2013     9    20     1139           1845      1014     1457           2210
    ##  5  2013     7    22      845           1600      1005     1044           1815
    ##  6  2013     4    10     1100           1900       960     1342           2211
    ##  7  2013     3    17     2321            810       911      135           1020
    ##  8  2013     7    22     2257            759       898      121           1026
    ##  9  2013    12     5      756           1700       896     1058           2020
    ## 10  2013     5     3     1133           2055       878     1250           2215
    ## # ℹ 336,766 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Missing values are always sorted at the end.

Manipulate columns (variables)

Select columns with select()

  • Select columns by variable name:

    select(flights, year, month, day)
    ## # A tibble: 336,776 × 3
    ##     year month   day
    ##    <int> <int> <int>
    ##  1  2013     1     1
    ##  2  2013     1     1
    ##  3  2013     1     1
    ##  4  2013     1     1
    ##  5  2013     1     1
    ##  6  2013     1     1
    ##  7  2013     1     1
    ##  8  2013     1     1
    ##  9  2013     1     1
    ## 10  2013     1     1
    ## # ℹ 336,766 more rows
  • Pull values of one column as a vector:

    pull(flights, year)

  • Select all columns between two variables (inclusive):

    select(flights, year:day)
    ## # A tibble: 336,776 × 3
    ##     year month   day
    ##    <int> <int> <int>
    ##  1  2013     1     1
    ##  2  2013     1     1
    ##  3  2013     1     1
    ##  4  2013     1     1
    ##  5  2013     1     1
    ##  6  2013     1     1
    ##  7  2013     1     1
    ##  8  2013     1     1
    ##  9  2013     1     1
    ## 10  2013     1     1
    ## # ℹ 336,766 more rows

  • Select all columns except those between two variables:

    select(flights, -(year:day))
    ## # A tibble: 336,776 × 16
    ##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
    ##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
    ##  1      517            515         2      830            819        11 UA     
    ##  2      533            529         4      850            830        20 UA     
    ##  3      542            540         2      923            850        33 AA     
    ##  4      544            545        -1     1004           1022       -18 B6     
    ##  5      554            600        -6      812            837       -25 DL     
    ##  6      554            558        -4      740            728        12 UA     
    ##  7      555            600        -5      913            854        19 B6     
    ##  8      557            600        -3      709            723       -14 EV     
    ##  9      557            600        -3      838            846        -8 B6     
    ## 10      558            600        -2      753            745         8 AA     
    ## # ℹ 336,766 more rows
    ## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
    ## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

    Read the documentation of select to learn more about allowed operations:

    • : for selecting a range of consecutive variables

    • ! for taking the complement of a set of variables

    • & and | for selecting the intersection or the union of two sets of variables

    • c() for combining selections


  • Select columns by positions:

    select(flights, seq(1, 10, by = 2))
    ## # A tibble: 336,776 × 5
    ##     year   day sched_dep_time arr_time arr_delay
    ##    <int> <int>          <int>    <int>     <dbl>
    ##  1  2013     1            515      830        11
    ##  2  2013     1            529      850        20
    ##  3  2013     1            540      923        33
    ##  4  2013     1            545     1004       -18
    ##  5  2013     1            600      812       -25
    ##  6  2013     1            558      740        12
    ##  7  2013     1            600      913        19
    ##  8  2013     1            600      709       -14
    ##  9  2013     1            600      838        -8
    ## 10  2013     1            600      753         8
    ## # ℹ 336,766 more rows

  • Move variables to the start of data frame:

    select(flights, time_hour, air_time, everything())
    ## # A tibble: 336,776 × 19
    ##    time_hour           air_time  year month   day dep_time sched_dep_time
    ##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
    ##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
    ##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
    ##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
    ##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
    ##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
    ##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
    ##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
    ##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
    ##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
    ## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
    ## # ℹ 336,766 more rows
    ## # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
    ## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
    ## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

  • Helper functions.

    • starts_with("abc"): matches names that begin with “abc”.

    • ends_with("xyz"): matches names that end with “xyz”.

    • contains("ijk"): matches names that contain “ijk”.

    • matches("(.)\\1"): selects variables that match a regular expression.

    • num_range("x", 1:3): matches x1, x2 and x3.

    • one_of()

Add new variables with mutate()

  • Add variables gain and speed:

    flights_sml <- 
      select(flights, year:day, ends_with("delay"), distance, air_time)
    flights_sml
    ## # A tibble: 336,776 × 7
    ##     year month   day dep_delay arr_delay distance air_time
    ##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    ##  1  2013     1     1         2        11     1400      227
    ##  2  2013     1     1         4        20     1416      227
    ##  3  2013     1     1         2        33     1089      160
    ##  4  2013     1     1        -1       -18     1576      183
    ##  5  2013     1     1        -6       -25      762      116
    ##  6  2013     1     1        -4        12      719      150
    ##  7  2013     1     1        -5        19     1065      158
    ##  8  2013     1     1        -3       -14      229       53
    ##  9  2013     1     1        -3        -8      944      140
    ## 10  2013     1     1        -2         8      733      138
    ## # ℹ 336,766 more rows
    mutate(flights_sml,
      gain = arr_delay - dep_delay,
      speed = distance / air_time * 60
    )
    ## # A tibble: 336,776 × 9
    ##     year month   day dep_delay arr_delay distance air_time  gain speed
    ##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
    ##  1  2013     1     1         2        11     1400      227     9  370.
    ##  2  2013     1     1         4        20     1416      227    16  374.
    ##  3  2013     1     1         2        33     1089      160    31  408.
    ##  4  2013     1     1        -1       -18     1576      183   -17  517.
    ##  5  2013     1     1        -6       -25      762      116   -19  394.
    ##  6  2013     1     1        -4        12      719      150    16  288.
    ##  7  2013     1     1        -5        19     1065      158    24  404.
    ##  8  2013     1     1        -3       -14      229       53   -11  259.
    ##  9  2013     1     1        -3        -8      944      140    -5  405.
    ## 10  2013     1     1        -2         8      733      138    10  319.
    ## # ℹ 336,766 more rows

  • Refer to columns that you’ve just created:

    mutate(flights_sml,
      gain = arr_delay - dep_delay,
      hours = air_time %/% 60,
      minute = air_time %% 60,
      gain_per_hour = gain / (hours + minute / 60)
    )
    ## # A tibble: 336,776 × 11
    ##     year month   day dep_delay arr_delay distance air_time  gain hours minute
    ##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>  <dbl>
    ##  1  2013     1     1         2        11     1400      227     9     3     47
    ##  2  2013     1     1         4        20     1416      227    16     3     47
    ##  3  2013     1     1         2        33     1089      160    31     2     40
    ##  4  2013     1     1        -1       -18     1576      183   -17     3      3
    ##  5  2013     1     1        -6       -25      762      116   -19     1     56
    ##  6  2013     1     1        -4        12      719      150    16     2     30
    ##  7  2013     1     1        -5        19     1065      158    24     2     38
    ##  8  2013     1     1        -3       -14      229       53   -11     0     53
    ##  9  2013     1     1        -3        -8      944      140    -5     2     20
    ## 10  2013     1     1        -2         8      733      138    10     2     18
    ## # ℹ 336,766 more rows
    ## # ℹ 1 more variable: gain_per_hour <dbl>

Modular arithmetic: %/% (integer division) and %% (remainder).


  • Only keep the new variables by transmute():

    transmute(flights,
      gain = arr_delay - dep_delay,
      hours = air_time %/% 60,
      minute = air_time %% 60,
      gain_per_hour = gain / (hours + minute / 60)
    )
    ## # A tibble: 336,776 × 4
    ##     gain hours minute gain_per_hour
    ##    <dbl> <dbl>  <dbl>         <dbl>
    ##  1     9     3     47          2.38
    ##  2    16     3     47          4.23
    ##  3    31     2     40         11.6 
    ##  4   -17     3      3         -5.57
    ##  5   -19     1     56         -9.83
    ##  6    16     2     30          6.4 
    ##  7    24     2     38          9.11
    ##  8   -11     0     53        -12.5 
    ##  9    -5     2     20         -2.14
    ## 10    10     2     18          4.35
    ## # ℹ 336,766 more rows

  • across(): apply a function (or a set of functions) to a set of columns

    flights %>%
      transmute(across(1:4, list(log = log, log2 = log2)))
    ## # A tibble: 336,776 × 8
    ##    year_log year_log2 month_log month_log2 day_log day_log2 dep_time_log
    ##       <dbl>     <dbl>     <dbl>      <dbl>   <dbl>    <dbl>        <dbl>
    ##  1     7.61      11.0         0          0       0        0         6.25
    ##  2     7.61      11.0         0          0       0        0         6.28
    ##  3     7.61      11.0         0          0       0        0         6.30
    ##  4     7.61      11.0         0          0       0        0         6.30
    ##  5     7.61      11.0         0          0       0        0         6.32
    ##  6     7.61      11.0         0          0       0        0         6.32
    ##  7     7.61      11.0         0          0       0        0         6.32
    ##  8     7.61      11.0         0          0       0        0         6.32
    ##  9     7.61      11.0         0          0       0        0         6.32
    ## 10     7.61      11.0         0          0       0        0         6.32
    ## # ℹ 336,766 more rows
    ## # ℹ 1 more variable: dep_time_log2 <dbl>
  • apply funs of one type

    flights %>%
      filter(dep_delay > 0 & arr_delay > 0) %>%
      transmute(across(where(is.numeric), log))
    ## # A tibble: 92,303 × 14
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
    ##  1  7.61     0     0     6.25           6.24     0.693     6.72           6.71
    ##  2  7.61     0     0     6.28           6.27     1.39      6.75           6.72
    ##  3  7.61     0     0     6.30           6.29     0.693     6.83           6.75
    ##  4  7.61     0     0     6.41           6.40     2.08      6.69           6.60
    ##  5  7.61     0     0     6.42           6.40     2.40      6.85           6.84
    ##  6  7.61     0     0     6.42           6.41     1.10      6.83           6.83
    ##  7  7.61     0     0     6.43           6.41     2.56      6.82           6.82
    ##  8  7.61     0     0     6.45           6.41     3.18      6.61           6.59
    ##  9  7.61     0     0     6.55           6.55     0.693     6.96           6.92
    ## 10  7.61     0     0     6.56           6.55     2.20      6.75           6.72
    ## # ℹ 92,293 more rows
    ## # ℹ 6 more variables: arr_delay <dbl>, flight <dbl>, air_time <dbl>,
    ## #   distance <dbl>, hour <dbl>, minute <dbl>

Summaries

summarise() collapses a data frame to a single row.



Pipe

  • Consider following analysis:

    by_dest <- group_by(flights, dest)
    delay <- summarise(by_dest, count = n(),
      dist = mean(distance, na.rm = TRUE),
      delay = mean(arr_delay, na.rm = TRUE)
    )
    delay <- filter(delay, count > 20, dest != "HNL")
    delay
    ## # A tibble: 96 × 4
    ##    dest  count  dist delay
    ##    <chr> <int> <dbl> <dbl>
    ##  1 ABQ     254 1826   4.38
    ##  2 ACK     265  199   4.85
    ##  3 ALB     439  143  14.4 
    ##  4 ATL   17215  757. 11.3 
    ##  5 AUS    2439 1514.  6.02
    ##  6 AVL     275  584.  8.00
    ##  7 BDL     443  116   7.05
    ##  8 BGR     375  378   8.03
    ##  9 BHM     297  866. 16.9 
    ## 10 BNA    6333  758. 11.8 
    ## # ℹ 86 more rows

  • Cleaner code using pipe %>%:

    delays <- flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      ) %>% 
      filter(count > 20, dest != "HNL")
    delays
    ## # A tibble: 96 × 4
    ##    dest  count  dist delay
    ##    <chr> <int> <dbl> <dbl>
    ##  1 ABQ     254 1826   4.38
    ##  2 ACK     265  199   4.85
    ##  3 ALB     439  143  14.4 
    ##  4 ATL   17215  757. 11.3 
    ##  5 AUS    2439 1514.  6.02
    ##  6 AVL     275  584.  8.00
    ##  7 BDL     443  116   7.05
    ##  8 BGR     375  378   8.03
    ##  9 BHM     297  866. 16.9 
    ## 10 BNA    6333  758. 11.8 
    ## # ℹ 86 more rows

  • ggplot2 accepts pipe too.

    delays %>%
      ggplot(mapping = aes(x = dist, y = delay)) +
      geom_point(aes(size = count), alpha = 1/3) + 
      geom_smooth(se = FALSE)
    ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Other summary functions

  • Location: mean(x), median(x).

    not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
    not_cancelled
    ## # A tibble: 327,346 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      517            515         2      830            819
    ##  2  2013     1     1      533            529         4      850            830
    ##  3  2013     1     1      542            540         2      923            850
    ##  4  2013     1     1      544            545        -1     1004           1022
    ##  5  2013     1     1      554            600        -6      812            837
    ##  6  2013     1     1      554            558        -4      740            728
    ##  7  2013     1     1      555            600        -5      913            854
    ##  8  2013     1     1      557            600        -3      709            723
    ##  9  2013     1     1      557            600        -3      838            846
    ## 10  2013     1     1      558            600        -2      753            745
    ## # ℹ 327,336 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        avg_delay1 = mean(arr_delay),
        avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
      )
    ## `summarise()` has grouped output by 'year', 'month'. You can override using the
    ## `.groups` argument.
    ## # A tibble: 365 × 5
    ## # Groups:   year, month [12]
    ##     year month   day avg_delay1 avg_delay2
    ##    <int> <int> <int>      <dbl>      <dbl>
    ##  1  2013     1     1     12.7         32.5
    ##  2  2013     1     2     12.7         32.0
    ##  3  2013     1     3      5.73        27.7
    ##  4  2013     1     4     -1.93        28.3
    ##  5  2013     1     5     -1.53        22.6
    ##  6  2013     1     6      4.24        24.4
    ##  7  2013     1     7     -4.95        27.8
    ##  8  2013     1     8     -3.23        20.8
    ##  9  2013     1     9     -0.264       25.6
    ## 10  2013     1    10     -5.90        27.3
    ## # ℹ 355 more rows

  • Spread: sd(x), IQR(x), mad(x).

    not_cancelled %>% 
      group_by(dest) %>% 
      summarise(distance_sd = sd(distance)) %>% 
      arrange(desc(distance_sd))
    ## # A tibble: 104 × 2
    ##    dest  distance_sd
    ##    <chr>       <dbl>
    ##  1 EGE         10.5 
    ##  2 SAN         10.4 
    ##  3 SFO         10.2 
    ##  4 HNL         10.0 
    ##  5 SEA          9.98
    ##  6 LAS          9.91
    ##  7 PDX          9.87
    ##  8 PHX          9.86
    ##  9 LAX          9.66
    ## 10 IND          9.46
    ## # ℹ 94 more rows

  • Rank: min(x), quantile(x, 0.25), max(x).

    # When do the first and last flights leave each day?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        first = min(dep_time),
        last = max(dep_time)
      )
    ## `summarise()` has grouped output by 'year', 'month'. You can override using the
    ## `.groups` argument.
    ## # A tibble: 365 × 5
    ## # Groups:   year, month [12]
    ##     year month   day first  last
    ##    <int> <int> <int> <int> <int>
    ##  1  2013     1     1   517  2356
    ##  2  2013     1     2    42  2354
    ##  3  2013     1     3    32  2349
    ##  4  2013     1     4    25  2358
    ##  5  2013     1     5    14  2357
    ##  6  2013     1     6    16  2355
    ##  7  2013     1     7    49  2359
    ##  8  2013     1     8   454  2351
    ##  9  2013     1     9     2  2252
    ## 10  2013     1    10     3  2320
    ## # ℹ 355 more rows

  • Position: first(x), nth(x, 2), last(x).

    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        first_dep = first(dep_time), 
        last_dep = last(dep_time)
      )
    ## `summarise()` has grouped output by 'year', 'month'. You can override using the
    ## `.groups` argument.
    ## # A tibble: 365 × 5
    ## # Groups:   year, month [12]
    ##     year month   day first_dep last_dep
    ##    <int> <int> <int>     <int>    <int>
    ##  1  2013     1     1       517     2356
    ##  2  2013     1     2        42     2354
    ##  3  2013     1     3        32     2349
    ##  4  2013     1     4        25     2358
    ##  5  2013     1     5        14     2357
    ##  6  2013     1     6        16     2355
    ##  7  2013     1     7        49     2359
    ##  8  2013     1     8       454     2351
    ##  9  2013     1     9         2     2252
    ## 10  2013     1    10         3     2320
    ## # ℹ 355 more rows

  • Count: n(x), sum(!is.na(x)), n_distinct(x).

    # Which destinations have the most carriers?
    not_cancelled %>% 
      group_by(dest) %>% 
      summarise(carriers = n_distinct(carrier)) %>% 
      arrange(desc(carriers))
    ## # A tibble: 104 × 2
    ##    dest  carriers
    ##    <chr>    <int>
    ##  1 ATL          7
    ##  2 BOS          7
    ##  3 CLT          7
    ##  4 ORD          7
    ##  5 TPA          7
    ##  6 AUS          6
    ##  7 DCA          6
    ##  8 DTW          6
    ##  9 IAD          6
    ## 10 MSP          6
    ## # ℹ 94 more rows

  • not_cancelled %>% 
      count(dest)
    ## # A tibble: 104 × 2
    ##    dest      n
    ##    <chr> <int>
    ##  1 ABQ     254
    ##  2 ACK     264
    ##  3 ALB     418
    ##  4 ANC       8
    ##  5 ATL   16837
    ##  6 AUS    2411
    ##  7 AVL     261
    ##  8 BDL     412
    ##  9 BGR     358
    ## 10 BHM     269
    ## # ℹ 94 more rows

  • not_cancelled %>% 
      count(tailnum, wt = distance)
    ## # A tibble: 4,037 × 2
    ##    tailnum      n
    ##    <chr>    <dbl>
    ##  1 D942DN    3418
    ##  2 N0EGMQ  239143
    ##  3 N10156  109664
    ##  4 N102UW   25722
    ##  5 N103US   24619
    ##  6 N104UW   24616
    ##  7 N10575  139903
    ##  8 N105UW   23618
    ##  9 N107US   21677
    ## 10 N108UW   32070
    ## # ℹ 4,027 more rows

  • # How many flights left before 5am? (these usually indicate delayed
    # flights from the previous day)
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(n_early = sum(dep_time < 500))
    ## `summarise()` has grouped output by 'year', 'month'. You can override using the
    ## `.groups` argument.
    ## # A tibble: 365 × 4
    ## # Groups:   year, month [12]
    ##     year month   day n_early
    ##    <int> <int> <int>   <int>
    ##  1  2013     1     1       0
    ##  2  2013     1     2       3
    ##  3  2013     1     3       4
    ##  4  2013     1     4       3
    ##  5  2013     1     5       3
    ##  6  2013     1     6       2
    ##  7  2013     1     7       2
    ##  8  2013     1     8       1
    ##  9  2013     1     9       3
    ## 10  2013     1    10       3
    ## # ℹ 355 more rows

  • # What proportion of flights are delayed by more than an hour?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(hour_perc = mean(arr_delay > 60))
    ## `summarise()` has grouped output by 'year', 'month'. You can override using the
    ## `.groups` argument.
    ## # A tibble: 365 × 4
    ## # Groups:   year, month [12]
    ##     year month   day hour_perc
    ##    <int> <int> <int>     <dbl>
    ##  1  2013     1     1    0.0722
    ##  2  2013     1     2    0.0851
    ##  3  2013     1     3    0.0567
    ##  4  2013     1     4    0.0396
    ##  5  2013     1     5    0.0349
    ##  6  2013     1     6    0.0470
    ##  7  2013     1     7    0.0333
    ##  8  2013     1     8    0.0213
    ##  9  2013     1     9    0.0202
    ## 10  2013     1    10    0.0183
    ## # ℹ 355 more rows

Grouped mutates (and filters)

  • Recall the flights_sml tibble created earlier:

    flights_sml
    ## # A tibble: 336,776 × 7
    ##     year month   day dep_delay arr_delay distance air_time
    ##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    ##  1  2013     1     1         2        11     1400      227
    ##  2  2013     1     1         4        20     1416      227
    ##  3  2013     1     1         2        33     1089      160
    ##  4  2013     1     1        -1       -18     1576      183
    ##  5  2013     1     1        -6       -25      762      116
    ##  6  2013     1     1        -4        12      719      150
    ##  7  2013     1     1        -5        19     1065      158
    ##  8  2013     1     1        -3       -14      229       53
    ##  9  2013     1     1        -3        -8      944      140
    ## 10  2013     1     1        -2         8      733      138
    ## # ℹ 336,766 more rows
  • Find the worst members of each group:

    flights_sml %>% 
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10)
    ## # A tibble: 3,306 × 7
    ## # Groups:   year, month, day [365]
    ##     year month   day dep_delay arr_delay distance air_time
    ##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    ##  1  2013     1     1       853       851      184       41
    ##  2  2013     1     1       290       338     1134      213
    ##  3  2013     1     1       260       263      266       46
    ##  4  2013     1     1       157       174      213       60
    ##  5  2013     1     1       216       222      708      121
    ##  6  2013     1     1       255       250      589      115
    ##  7  2013     1     1       285       246     1085      146
    ##  8  2013     1     1       192       191      199       44
    ##  9  2013     1     1       379       456     1092      222
    ## 10  2013     1     2       224       207      550       94
    ## # ℹ 3,296 more rows

  • Find all groups bigger than a threshold:

    (popular_dests <- flights %>% 
      group_by(dest) %>% 
      filter(n() > 365))
    ## # A tibble: 332,577 × 19
    ## # Groups:   dest [77]
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      517            515         2      830            819
    ##  2  2013     1     1      533            529         4      850            830
    ##  3  2013     1     1      542            540         2      923            850
    ##  4  2013     1     1      544            545        -1     1004           1022
    ##  5  2013     1     1      554            600        -6      812            837
    ##  6  2013     1     1      554            558        -4      740            728
    ##  7  2013     1     1      555            600        -5      913            854
    ##  8  2013     1     1      557            600        -3      709            723
    ##  9  2013     1     1      557            600        -3      838            846
    ## 10  2013     1     1      558            600        -2      753            745
    ## # ℹ 332,567 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

  • Standardize to compute per group metrics:

    popular_dests %>% 
      filter(arr_delay > 0) %>% 
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
      select(year:day, dest, arr_delay, prop_delay)
    ## # A tibble: 131,106 × 6
    ## # Groups:   dest [77]
    ##     year month   day dest  arr_delay prop_delay
    ##    <int> <int> <int> <chr>     <dbl>      <dbl>
    ##  1  2013     1     1 IAH          11  0.000111 
    ##  2  2013     1     1 IAH          20  0.000201 
    ##  3  2013     1     1 MIA          33  0.000235 
    ##  4  2013     1     1 ORD          12  0.0000424
    ##  5  2013     1     1 FLL          19  0.0000938
    ##  6  2013     1     1 ORD           8  0.0000283
    ##  7  2013     1     1 LAX           7  0.0000344
    ##  8  2013     1     1 DFW          31  0.000282 
    ##  9  2013     1     1 ATL          12  0.0000400
    ## 10  2013     1     1 DTW          16  0.000116 
    ## # ℹ 131,096 more rows

Combine tables | r4ds chapter 13

nycflights13 package has >1 tables:





Relational data

Keys

  • A primary key uniquely identifies an observation in its own table.

    • single variable, e.g., tailnum for each plane

    • multiple variables, e.g., year, month, day, hour, and origin to identify an observation in weather

  • A foreign key uniquely identifies an observation in another table.

Good practice: verify the primary keys by using count()

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
## # A tibble: 3 × 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

Combine variables (columns)

Demo tables

  • (x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         3, "x3"
    ))
    ## # A tibble: 3 × 2
    ##     key val_x
    ##   <dbl> <chr>
    ## 1     1 x1   
    ## 2     2 x2   
    ## 3     3 x3
    (y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2",
         4, "y3"
    ))
    ## # A tibble: 3 × 2
    ##     key val_y
    ##   <dbl> <chr>
    ## 1     1 y1   
    ## 2     2 y2   
    ## 3     4 y3

Inner join

  • An inner join matches pairs of observations whenever their keys are equal:

  • inner_join(x, y, by = "key")
    ## # A tibble: 2 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2

    Same as

    x %>% inner_join(y, by = "key")

Generally not appropriate for use because of loss of observations

Outer join

  • An outer join keeps observations that appear in at least one of the tables.

  • Three types of outer joins:

    • A left join keeps all observations in x.
    left_join(x, y, by = "key")
    ## # A tibble: 3 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2   
    ## 3     3 x3    <NA>
    • A right join keeps all observations in y.
    right_join(x, y, by = "key")
    ## # A tibble: 3 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2   
    ## 3     4 <NA>  y3
    • A full join keeps all observations in x or y.
    full_join(x, y, by = "key")
    ## # A tibble: 4 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2   
    ## 3     3 x3    <NA> 
    ## 4     4 <NA>  y3

Duplicate keys

  • One table has duplicate keys.


  • x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         2, "x3",
         1, "x4"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2"
    )
    left_join(x, y, by = "key")
    ## # A tibble: 4 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2   
    ## 3     2 x3    y2   
    ## 4     1 x4    y1

  • Both tables have duplicate keys. You get all possible combinations, the Cartesian product:


  • x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         2, "x3",
         3, "x4"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2",
         2, "y3",
         3, "y4"
    )
    left_join(x, y, by = "key")
    ## Warning in left_join(x, y, by = "key"): Detected an unexpected many-to-many relationship between `x` and `y`.
    ## ℹ Row 2 of `x` matches multiple rows in `y`.
    ## ℹ Row 2 of `y` matches multiple rows in `x`.
    ## ℹ If a many-to-many relationship is expected, set `relationship =
    ##   "many-to-many"` to silence this warning.
    ## # A tibble: 6 × 3
    ##     key val_x val_y
    ##   <dbl> <chr> <chr>
    ## 1     1 x1    y1   
    ## 2     2 x2    y2   
    ## 3     2 x2    y3   
    ## 4     2 x3    y2   
    ## 5     2 x3    y3   
    ## 6     3 x4    y4

  • Let’s create a narrower table from the flights data:

    flights2 <- flights %>% 
      select(year:day, hour, origin, dest, tailnum, carrier)
    flights2
    ## # A tibble: 336,776 × 8
    ##     year month   day  hour origin dest  tailnum carrier
    ##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
    ##  1  2013     1     1     5 EWR    IAH   N14228  UA     
    ##  2  2013     1     1     5 LGA    IAH   N24211  UA     
    ##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
    ##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
    ##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
    ##  6  2013     1     1     5 EWR    ORD   N39463  UA     
    ##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
    ##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
    ##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
    ## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
    ## # ℹ 336,766 more rows
  • We want to merge with the weather table:

    weather
    ## # A tibble: 26,115 × 15
    ##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
    ##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
    ##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
    ##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
    ##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
    ##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
    ##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
    ##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
    ##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
    ##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
    ##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
    ## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
    ## # ℹ 26,105 more rows
    ## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
    ## #   visib <dbl>, time_hour <dttm>

Defining the key columns

  • by = NULL (default): use all variables that appear in both tables:

    # same as: flights2 %>% left_join(weather)
    left_join(flights2, weather)
    ## Joining with `by = join_by(year, month, day, hour, origin)`
    ## # A tibble: 336,776 × 18
    ##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
    ##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
    ##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
    ##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
    ##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
    ##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
    ##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
    ##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
    ##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
    ##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
    ##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
    ## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
    ## # ℹ 336,766 more rows
    ## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
    ## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

  • by = "x": use the common variable x:

    # same as: flights2 %>% left_join(weather)
    left_join(flights2, planes, by = "tailnum")
    ## # A tibble: 336,776 × 16
    ##    year.x month   day  hour origin dest  tailnum carrier year.y type            
    ##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
    ##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
    ##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
    ##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
    ##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
    ##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
    ##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
    ##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
    ##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
    ##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
    ## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
    ## # ℹ 336,766 more rows
    ## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
    ## #   seats <int>, speed <int>, engine <chr>

  • by = c("a" = "b"): match variable a in table x to the variable b in table y.

    # same as: flights2 %>% left_join(weather)
    left_join(flights2, airports, by = c("dest" = "faa"))
    ## # A tibble: 336,776 × 15
    ##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
    ##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
    ##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
    ##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
    ##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
    ##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
    ##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
    ##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
    ##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
    ##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
    ##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
    ## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
    ## # ℹ 336,766 more rows
    ## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Combine cases (rows)


Semi-join

  • semi_join(x, y) keeps the rows in x that have a match in y.

  • Useful to see what will be joined.


  • semi_join(flights, top_dest)
    ## Joining with `by = join_by(dest)`
    ## # A tibble: 141,145 × 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    ##  1  2013     1     1      542            540         2      923            850
    ##  2  2013     1     1      554            600        -6      812            837
    ##  3  2013     1     1      554            558        -4      740            728
    ##  4  2013     1     1      555            600        -5      913            854
    ##  5  2013     1     1      557            600        -3      838            846
    ##  6  2013     1     1      558            600        -2      753            745
    ##  7  2013     1     1      558            600        -2      924            917
    ##  8  2013     1     1      558            600        -2      923            937
    ##  9  2013     1     1      559            559         0      702            706
    ## 10  2013     1     1      600            600         0      851            858
    ## # ℹ 141,135 more rows
    ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    ## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    ## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Anti-join

  • anti_join(x, y) keeps the rows that don’t have a match.

  • Useful to see what will not be joined.


  • flights %>% 
      anti_join(planes, by = "tailnum") %>%
      count(tailnum, sort = TRUE)
    ## # A tibble: 722 × 2
    ##    tailnum     n
    ##    <chr>   <int>
    ##  1 <NA>     2512
    ##  2 N725MQ    575
    ##  3 N722MQ    513
    ##  4 N723MQ    507
    ##  5 N713MQ    483
    ##  6 N735MQ    396
    ##  7 N0EGMQ    371
    ##  8 N534MQ    364
    ##  9 N542MQ    363
    ## 10 N531MQ    349
    ## # ℹ 712 more rows

Set operations

  1. All these operations work with a complete row, comparing the values of every variable.

  2. These operations expect the x and y inputs to have the same variables.

  • Generate two tables:

    (df1 <- tribble(
      ~x, ~y,
       1,  1,
       2,  1
    ))
    ## # A tibble: 2 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     1
    ## 2     2     1
    (df2 <- tribble(
      ~x, ~y,
       1,  1,
       1,  2
    ))
    ## # A tibble: 2 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     1
    ## 2     1     2

  • bind_rows(x, y) stacks table x on top of y.

    bind_rows(df1, df2)
    ## # A tibble: 4 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     1
    ## 2     2     1
    ## 3     1     1
    ## 4     1     2
  • intersect(x, y) returns rows that appear in both x and y.

    intersect(df1, df2)
    ## # A tibble: 1 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     1

  • union(x, y) returns unique observations in x and y.

    union(df1, df2)
    ## # A tibble: 3 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     1
    ## 2     2     1
    ## 3     1     2

  • setdiff(x, y) returns rows that appear in x but not in y.

    setdiff(df1, df2)
    ## # A tibble: 1 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     2     1
    setdiff(df2, df1)
    ## # A tibble: 1 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1     2

Cheat sheet

RStudio cheat sheet is extremely helpful.