1 Introduction

Welcome to the Program in Statistics and Methodology’s (PRISM) Data Manipulation in the Tidyverse Workshop. This workshop was delivered on 25 September, 2018 as part of PRISM’s regularly scheduled programming and Data Analytics Month @ Ohio State.

1.1 What is the Tidyverse?

The Tidyverse is a suite of packages and tools designed to help data scientists import, clean, manipulate, and visualize their data. The Tidyverse packages are unified by a shared philosophy of data structure (“tidy data”) and programming. The Tidyverse suite of packages consists of readr, tibble, tidyr, dplyr, purrr, ggplot2, stringr, and forcats. The Tidyverse, itself, is also a package that can be installed and accessed.

set.seed(1234)
install.packages(c("tidyverse", "testit", "microbenchmark", "RMySQL"), repos = "http://cran.us.r-project.org")

## Installing packages into '/Users/benjamincampbell/Library/R/3.4/library'
## (as 'lib' is unspecified)

## 
## The downloaded binary packages are in
##  /var/folders/vk/31y3cgkd3l52_0ynqgx9lrxr0000gn/T//RtmpcKo351/downloaded_packages

library("tidyverse")

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library("testit")
library("microbenchmark")
library("RMySQL")

## Loading required package: DBI

1.2 Why use the Tidyverse?

There are significant differences between how Base R and the Tidyverse handle particular tasks. Typically, proponents of the Tidyverse argue that there are several reasons it should be used:

Consistency across stages of data analysis, from data importation to tidying, manipulation, and visualization.
- Common grammar across stages.
- Pipe operator (%>%).
Parsimonious, vectorized code.
Very computationally efficient, often quicker than Base R.
Has reached critical mass.
- More downstream dependencies.
- Many collaborators may use it (100,000+ installs).

There are of course some drawbacks for Base R users:

The Tidyverse packages may not behave well with other packages, it is designed to be self-contained.
New syntax is introduced.
New classes (tbl) may not behave well with other functions, unconventional row labeling behavior.

1.3 What is our goal for today?

By the end of this workshop, it is my hope that you will feel comfortable using the Tidyverse to tidy and transform data. You will be familiar with the basic Tidyverse packages for tidying data (tibble, tidyr) and manipulating data (dplyr).

As this workshop is primarily focused on wrangling data, we will punt on data importation (haven, readr) and data visualization (ggplot2) in the Tidyverse.

2 Tidy Data and Tidying Data

There are two packages for producing “tidy” data within the Tidyverse framework. The first, tibble, is a package introducing a new class of data frames, the tbl. The second, tidyr, is a package designed to assist in producing “tidy” data. Tidy data is described as data where each row is a different observation and each column is a different feature measured on each observation.

2.1 `tibble`

2.1.1 Features

The tibble package introduces a new class of data frame, the tbl, which has a lot of cool features:

It never changes an input’s type! For example, character data are not automatically converted to factors.
You can produce data frames with list or vector columns.
- tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
Column names are preserved (periods aren’t inserted in whitespace).
- names(data.frame('ben likes cats' = 1)) v. names(tibble('ben likes cats' = 1))
Arguments are evaluated sequentially.
- tibble(x = 1:5, y = x - mean(x))

There are a few things, however, that trip people up:

Tibbles do not have a row name feature
Tibbles only recycle vectors of length 1.

Here are some examples:

# Input types retain type (no factor switching)
  # Base R data.frame
df <- data.frame(A = letters)
class(letters)

## [1] "character"

class(df$A)

## [1] "factor"

  # tibble tibble
tib <- tibble(A = letters)
class(tib$A)

## [1] "character"

  # tibble data_frame
tib_df <- data_frame(A = letters)
class(tib_df$A)

## [1] "character"

# Data frames with list-columns
tibble(x = 1:3, y = list(1:5, 1:10, 1:20))

## # A tibble: 3 x 2
##       x y         
##   <int> <list>    
## 1     1 <int [5]> 
## 2     2 <int [10]>
## 3     3 <int [20]>

# Column naming
names(data.frame('ben likes cats' = 1))

## [1] "ben.likes.cats"

names(tibble('ben likes cats' = 1))

## [1] "ben likes cats"

# Sequential naming
tibble(x = 1:5, y = x - mean(x))

## # A tibble: 5 x 2
##       x     y
##   <int> <dbl>
## 1     1    -2
## 2     2    -1
## 3     3     0
## 4     4     1
## 5     5     2

2.1.2 Efficient Coercion

You can also coerce data to a tibble using the as_tibble() command. Such a command is written explicitly to be much more efficient than the base as.data.frame() command.

if (requireNamespace("microbenchmark")) {
  l <- replicate(26, sample(100), simplify = FALSE)
  names(l) <- letters

  microbenchmark::microbenchmark(
    as_tibble(l),
    as.data.frame(l)
  )
}

## Unit: microseconds
##              expr      min       lq      mean   median        uq      max
##      as_tibble(l)  367.012  401.799  565.8195  443.115  603.5645 2341.569
##  as.data.frame(l) 1180.468 1227.057 1469.4458 1299.943 1513.2675 3359.543
##  neval cld
##    100  a 
##    100   b

2.1.3 Mangable Printing

When printed, tibbles will be much more manageable, printing only the first ten rows and all the columns that will fit on one screen. This of course can be adjusted with a series of options:

options(tibble.print_max = n, tibble.print_min = m): if there are more than n rows, print only the first m rows. Use options(tibble.print_max = Inf) to show all rows, similar to the default print method for data frames.
options(tibble.width = Inf) will always print all columns, regardless of the width of the screen.

tibble(alphabet = LETTERS)

## # A tibble: 26 x 1
##    alphabet
##    <chr>   
##  1 A       
##  2 B       
##  3 C       
##  4 D       
##  5 E       
##  6 F       
##  7 G       
##  8 H       
##  9 I       
## 10 J       
## # ... with 16 more rows

2.1.4 Subsetting and Indexing

Tibbles have more consistent behavior when subsetting. When subsetting or indexing data frames, using square brackets [,] may return a vector or a data frame. However, when using square bracket indexing [,] for tibbles, a tibble will always be returned. When you want a single column vector from a tibble you may use $ or [[]] indexing.

# Indexing and subsetting using square brackets
df1 <- data.frame(x = letters, y = 1:length(letters))
class(df1[1:10,]) # indexing > 1 rows

## [1] "data.frame"

class(df1[,1]) # indexing 1 column

## [1] "factor"

df2 <- tibble(x = letters, y = 1:length(letters))
class(df2[1:10,]) # indexing > 1 rows

## [1] "tbl_df"     "tbl"        "data.frame"

class(df2[,1]) # indexing 1 column

## [1] "tbl_df"     "tbl"        "data.frame"

# Indexing columns to vectors
class(df2[[1]])

## [1] "character"

class(df2$x)

## [1] "character"

Data frames allow for partial matching for column names using the $ indexing method. This does not work for tibbles.

df <- data.frame(let = letters)
df$l

##  [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

df2 <- tibble(let = letters)
df2$l

## Warning: Unknown or uninitialised column: 'l'.

## NULL

2.1.5 Recycling

Finally, tibbles are more conservative in how they handle data recycling relative to data frames. In data frames, recycling occurs when full recycling is allowed. For example, if a data frame is to have 20 rows and for one column only 5 values are supplied, those 5 values will be repeated 4 times to populate all 20 rows. This will not work if a data frame is to have 20 rows and for one column only 6 values are supplied, as 20 divided by 6 does not produce an integer. Tibbles will only recycle values of length 1. In other words, you cannot truly recycle values with tibbles, but you can broadcast.

# Tibbles and data frames allow for broadcasting, expanding b to match the length of letters
data.frame(a = letters[1:4], b = 1)

##   a b
## 1 a 1
## 2 b 1
## 3 c 1
## 4 d 1

tibble(a = letters[1:4], b = 1)

## # A tibble: 4 x 2
##   a         b
##   <chr> <dbl>
## 1 a         1
## 2 b         1
## 3 c         1
## 4 d         1

# Data frames will allow for full recycling
data.frame(a = letters[1:4], b = 1:2)

##   a b
## 1 a 1
## 2 b 2
## 3 c 1
## 4 d 2

# but not partial recycling
testit::has_error(data.frame(a = letters[1:4], b = 1:3))

## [1] TRUE

# Neither of these will work for tibbles
testit::has_error(tibble(a = letters[1:4], b = 1:2))

## [1] TRUE

testit::has_error(tibble(a = letters[1:4], b = 1:5))

## [1] TRUE

2.2 `tidyr`

2.2.1 Tidy Data

Central to data analysis, and even data preparation, is structuring datasets to the format necessary for analysis. Hadley Wickham refers to this step as data tidying. There are several principles of tidy data laid out in Wickham’s The Journal of Statistical Software piece “Tidy Data.” These include:

A dataset that is a collection of values, either stored as numbers (if quantitative) or strings (if qualitative).
Each value belongs to a variable, stored column-wise, or an observation, stored row-wise.
A variable contains all values that measure a common attribute across observations (such as GDP).
An observation contains a unit’s collection of values measured across variables.
Data can be relational across tables.
- Observational units of the same type should be stored in the same table.
- Observational units of different types should be stored in different tables.

Tidy data is designed to work well with vectorized languages like R because it ensures that observations’ variable values are always paired. Data is not tidy when:

Column headers are not variable names.
Multiple variables are stored in the same column.
Variables are stored in rows in addition to columns.
Observational units stored in a table are of different types.
A single observation is stored across multiple tables.

Wickham’s tidyr package is designed to help analysts deal with untidy data and introduces three tools: gathering, separating, and spreading.

2.2.2 Gathering Data

The gather() command will take multiple columns and gather them into tidy data, making landscape data into portrait data. If we have columns that are not variables, then gather is useful. Here is an example of how gather() is useful. In this case, we have heart rates for 3 people over 2 treatment conditions.

messy <- tibble(
  name = c("Bob", "Peter", "Jan"),
  a = c(67, 130, 150),
  b = c(56, 121, 20)
)
messy

## # A tibble: 3 x 3
##   name      a     b
##   <chr> <dbl> <dbl>
## 1 Bob      67    56
## 2 Peter   130   121
## 3 Jan     150    20

In this case we have three variables but currently we only have one of them in a column (name). There are two other variables that we need to create: treatment and heart rate. We can use gather to produce these variables and tidy the data:

messy %>%
  gather(drug, heartrate, a:b)

## # A tibble: 6 x 3
##   name  drug  heartrate
##   <chr> <chr>     <dbl>
## 1 Bob   a            67
## 2 Peter a           130
## 3 Jan   a           150
## 4 Bob   b            56
## 5 Peter b           121
## 6 Jan   b            20

2.2.3 Separating Data

If variables are grouped together into one, we may be interested in using the separate() function to disentangle them. This function is related to extract(), which uses regular expressions for splitting instead of pattern. Here we have some fake data examining two outcomes observed over two different time periods according to a treatment or control condition.

messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  outcome1.before = runif(4),
  outcome2.before = runif(4),
  outcome1.after = runif(4),
  outcome2.after = runif(4)
)

messy

##   id       trt outcome1.before outcome2.before outcome1.after
## 1  1   control       0.9751596       0.9127080      0.5475543
## 2  2   control       0.8759398       0.7397774      0.8516214
## 3  3 treatment       0.4292365       0.9504254      0.5014444
## 4  4 treatment       0.3547238       0.5261304      0.9361675
##   outcome2.after
## 1      0.5778685
## 2      0.8627265
## 3      0.7516653
## 4      0.6981145

We will begin tidying this messy data by using the gather() command to turn the treatment columns into key-value paired data according to the key id and time:

# Here we are using the pipe operator which takes an argument or object on the left and passess it as the first argument to the function on the right hand side.  
# We are specifying two new columns for key and value and specifying that the variables other than id and trt should be gathered
tidier <- messy %>%
  gather(time, value, -id, -trt)
tidier %>% head(8)

##   id       trt            time     value
## 1  1   control outcome1.before 0.9751596
## 2  2   control outcome1.before 0.8759398
## 3  3 treatment outcome1.before 0.4292365
## 4  4 treatment outcome1.before 0.3547238
## 5  1   control outcome2.before 0.9127080
## 6  2   control outcome2.before 0.7397774
## 7  3 treatment outcome2.before 0.9504254
## 8  4 treatment outcome2.before 0.5261304

Then, using separate() we will split the key into the outcome number and the time period using regular expressions:

tidy <- tidier %>%
  separate(time, into = c("outcome", "time"), sep = "\\.")
tidy %>% head(8)

##   id       trt  outcome   time     value
## 1  1   control outcome1 before 0.9751596
## 2  2   control outcome1 before 0.8759398
## 3  3 treatment outcome1 before 0.4292365
## 4  4 treatment outcome1 before 0.3547238
## 5  1   control outcome2 before 0.9127080
## 6  2   control outcome2 before 0.7397774
## 7  3 treatment outcome2 before 0.9504254
## 8  4 treatment outcome2 before 0.5261304

2.2.4 Spreading Data

The spread() function takes two columns, a key and a value, and then spreads them into multiple columns. This takes portrait data and makes it into landscape data. If you have variables that are in rows instead of columns, this function is particularly useful. Here is an example:

# Generate fake stock price data
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 100, 10),
  Y = rnorm(10, 125, 20),
  Z = rnorm(10, 150, 4)
)

head(stocks)

##         time         X         Y        Z
## 1 2009-01-01 108.01125  93.08611 152.3230
## 2 2009-01-02  79.84799 121.25669 153.8985
## 3 2009-01-03 100.47396 136.35269 152.3498
## 4 2009-01-04 106.13097 128.25358 150.9811
## 5 2009-01-05  98.29982  97.56034 143.7360
## 6 2009-01-06 102.25760  87.32095 147.9675

# Gather to make these into long data
stocks_long <- stocks %>% gather(stock, price, -time)
head(stocks_long)

##         time stock     price
## 1 2009-01-01     X 108.01125
## 2 2009-01-02     X  79.84799
## 3 2009-01-03     X 100.47396
## 4 2009-01-04     X 106.13097
## 5 2009-01-05     X  98.29982
## 6 2009-01-06     X 102.25760

# Make this back into wide data, which is the same as the original
stocks_wide <- stocks_long %>% spread(stock, price)
head(stocks_wide)

##         time         X         Y        Z
## 1 2009-01-01 108.01125  93.08611 152.3230
## 2 2009-01-02  79.84799 121.25669 153.8985
## 3 2009-01-03 100.47396 136.35269 152.3498
## 4 2009-01-04 106.13097 128.25358 150.9811
## 5 2009-01-05  98.29982  97.56034 143.7360
## 6 2009-01-06 102.25760  87.32095 147.9675

setequal(stocks, stocks_wide)

## TRUE

# We can make this even wider where the row is the stock
stocks_wider <- stocks_long %>% spread(time, price)
head(stocks_wider)

##   stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06
## 1     X  108.01125   79.84799   100.4740   106.1310   98.29982  102.25760
## 2     Y   93.08611  121.25669   136.3527   128.2536   97.56034   87.32095
## 3     Z  152.32303  153.89850   152.3498   150.9811  143.73599  147.96749
##   2009-01-07 2009-01-08 2009-01-09 2009-01-10
## 1   108.9240   98.52387   115.3718   104.5262
## 2   123.6391  138.74614   110.8400   132.5535
## 3   153.4331  151.80747   146.4084   151.2760

2.2.5 Note on `reshape2`

When you’re interested in the general reshaping of data, the reshape2 package may be useful. Given that most analysis requires tidy data, the tidyr package should suffice for most cases.

3 Data Manipulation

Data manipulation is a core task in data analysis. Data manipulation refers to the process of changing or modifying data so you can efficiently extract, filter, and transform it. In the Tidyverse, the dplyr package consists of five “verbs” that are useful for data manipulation: select(), mutate(), filter(), arrange(), and summarize(). dplyr also assists in binding, grouping, and merging data, as well as accessing external databases.

3.1 Selecting Columns

Often we will want to select columns from a data frame. To do this in dplyr there is the select() function which takes a table and a number of variable names. You can also use - to exclude some variables, or index variables using column numbers.

data(iris)
reduced <- iris %>% select(Species, Sepal.Width)
head(reduced)

##   Species Sepal.Width
## 1  setosa         3.5
## 2  setosa         3.0
## 3  setosa         3.2
## 4  setosa         3.1
## 5  setosa         3.6
## 6  setosa         3.9

We can also link this with a series of helper functions to assist in selection. This makes select() particularly helpful:

starts_with("X"): every name that starts with “X”,
ends_with("X"): every name that ends with “X”,
contains("X"): every name that contains “X”,
matches("X"): every name that matches “X”, where “X” can be a regular expression,
num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
one_of(x): every name that appears in x, which should be a character vector.

# Helper functions are useful for selecting columns
s_cols <- iris %>% select(starts_with("S"))
colnames(s_cols)

## [1] "Sepal.Length" "Sepal.Width"  "Species"

dot_cols <- iris %>% select(contains("."))
colnames(dot_cols)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"

3.2 Mutating Data

mutate() does the opposite of select(), it allows you to add new columns to a tibble. The syntax of mutate() is similar to select(). The first argument is a tibble, then the second is a statement about how to define the new column.

# You can add multiple columns at once
iris <- iris %>% mutate(Sepal.Area = Sepal.Length * Sepal.Width,
                        Petal.Area = Petal.Length * Petal.Width)
summary(iris$Sepal.Area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   15.66   17.66   17.82   20.32   30.02

summary(iris$Petal.Area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.110   0.420   5.615   5.794   9.690  15.870

3.3 Filtering Data

filter() filters out rows based upon some expression. Think about it as the row-wise function of select(). We can narrow observations based upon Boolean logicals. filter() will return all rows that pass the logical test that is specified. This returns a new dataset that must be saved.

setosa_df <- iris %>% filter(Species == "setosa")
large_sepals <- iris %>% filter(Sepal.Area > mean(Sepal.Area))

# Let's show how we can create combine these
large_irises <- iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
    filter(Sepal.Area > mean(Sepal.Area) & Petal.Area > mean(Petal.Area))
head(large_irises)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Sepal.Area
## 1          7.0         3.2          4.7         1.4 versicolor      22.40
## 2          6.4         3.2          4.5         1.5 versicolor      20.48
## 3          6.9         3.1          4.9         1.5 versicolor      21.39
## 4          6.5         2.8          4.6         1.5 versicolor      18.20
## 5          6.3         3.3          4.7         1.6 versicolor      20.79
## 6          6.6         2.9          4.6         1.3 versicolor      19.14
##   Petal.Area
## 1       6.58
## 2       6.75
## 3       7.35
## 4       6.90
## 5       7.52
## 6       5.98

3.4 Arranging Data

arrange() reorders rows based upon certain operators. The analyst tells the function which columns to arrange the rows by. This will default to increasing/ascending order, but you can specify descending order using desc(). You can also list a second variable as a tie breaker which will sort ties among the first variable. For a character variable, rows will be arranged in alphabetical order. For factors, they’ll be arranged according to the factor ordering. Let’s build on the prior data frame and arrange it in descending order of sepal size using desc().

head(large_irises)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Sepal.Area
## 1          7.0         3.2          4.7         1.4 versicolor      22.40
## 2          6.4         3.2          4.5         1.5 versicolor      20.48
## 3          6.9         3.1          4.9         1.5 versicolor      21.39
## 4          6.5         2.8          4.6         1.5 versicolor      18.20
## 5          6.3         3.3          4.7         1.6 versicolor      20.79
## 6          6.6         2.9          4.6         1.3 versicolor      19.14
##   Petal.Area
## 1       6.58
## 2       6.75
## 3       7.35
## 4       6.90
## 5       7.52
## 6       5.98

sorted_large_irises <- iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
    filter(Sepal.Area > mean(Sepal.Area) & Petal.Area > mean(Petal.Area)) %>%
      arrange(desc(Sepal.Area)) # removing desc() produces ascending order

head(sorted_large_irises)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species Sepal.Area
## 1          7.9         3.8          6.4         2.0 virginica      30.02
## 2          7.7         3.8          6.7         2.2 virginica      29.26
## 3          7.2         3.6          6.1         2.5 virginica      25.92
## 4          7.7         3.0          6.1         2.3 virginica      23.10
## 5          7.2         3.2          6.0         1.8 virginica      23.04
## 6          7.6         3.0          6.6         2.1 virginica      22.80
##   Petal.Area
## 1      12.80
## 2      14.74
## 3      15.25
## 4      14.03
## 5      10.80
## 6      13.86

3.5 Summarizing Data

We can also use the summarise() or summarize() function to create a new dataset of summary statistics. These functions include the same syntax as the prior commands and allows you to choose summary statistics to include in a new aggregated dataframe. Aggregate functions should be used, these are functions that take on a vector of values and return a single number.

dplyr also comes with a variety of aggregate functions that are in addition to those defined in R:

first(x): The first element of vector x.
last(x): The last element of vector x.
nth(x, n): The nth element of vector x.
n(): The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x): The number of unique values in vector x.

iris_summary <- iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
    summarize(n_obs = n(),
              n_species = n_distinct(Species),
              average_sepal_area = mean(Sepal.Area),
              average_petal_area = mean(Petal.Area))
iris_summary

##   n_obs n_species average_sepal_area average_petal_area
## 1   150         3           17.82287           5.794067

3.6 Grouping Data

dplyr also has built-in means to group data. summarize() returns aggregated measures for a vector or for an entire dataset. But, what if we are interested in doing aggregate measures for groups of rows? We can look at group-based differences by using the group_by() function. This function takes a tibble and some group-indicator. The dataset will look the same, but then when you summarise() a grouped dataset, the cool part comes in, you get grouped aggregate statistics.

iris_summary <- iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
    group_by(Species) %>% 
      summarize(n_obs = n(),
                average_sepal_area = mean(Sepal.Area),
                average_petal_area = mean(Petal.Area))
iris_summary

## # A tibble: 3 x 4
##   Species    n_obs average_sepal_area average_petal_area
##   <fct>      <int>              <dbl>              <dbl>
## 1 setosa        50               17.3              0.366
## 2 versicolor    50               16.5              5.72 
## 3 virginica     50               19.7             11.3

3.7 Combining Data

3.7.1 Binding Data

The easiest way to combine datasets is through binding. Binding requires that dataframes have the same columns in the same order and meaning (row binding), or when they have the same rows and just different columns (column binding). In base R you would typically use rbind() or cbind(). dplyr has an alternative option, bind_rows() and bind_cols().

bind_rows() combines two datasets with the same columns in the same orders to merge a dataset on top of one another. bind_cols() combines two datasets with the same rows in the same order to merge a dataset side by side. This assumes that order is matching for rows or columns. Joins solve this problem in binding by using keys, but binds require correct orders. These are quicker and return a tibble. In addition, the syntax is a little more flexible as you can pass it lists of dataframes. You can also add the .id argument that will give a new column that can help with grouping.

# Column binding is easy!
new_vars <- tibble(variable_1 = rnorm(nrow(iris), 0, 1),
                   variable_2 = rnorm(nrow(iris), 1, 1))

iris <- bind_cols(iris, new_vars)

# Subset into list
iris_subsetted <- list(filter(iris, Species == "setosa"),
                       filter(iris, Species == "versicolor"),
                       filter(iris, Species == "virginica"))

# See length of different data frames
unlist(lapply(iris_subsetted, nrow))

## [1] 50 50 50

# Bind rows together
iris_whole <- bind_rows(iris_subsetted)

# We can also use use purrr when we want to feed multiple objects torepeat some pipeline over an object
# This is useful when you are merging or joining multiple data frames together
library(purrr)
iris_whole_purr <- iris_subsetted %>% reduce(bind_rows)

# Objects are the same
setequal(iris_whole, iris_whole_purr)

## TRUE

3.8 Joining Data

Often our data in R exist across multiple datasets. As such, we will want to join datasets so that assembling datasets will be much easier and more effective. In base R there is the merge() function. dplyr’s join functions are better than base R’s merge() in that it preserves row order, has better syntax, and can be used for different databases. The grammar of dplyr’s data joins are similar to PostgreSQL’s grammar.

3.8.1 Keys

A join combines columns of two or more dataframe according to matching values in some identifier variables. To do this, analysts identify a set of keys that exist in both dataframes and combine tables according to matching key values in both dataframes. The first table’s key is the primary key and the second table’s key is the foreign key. The primary key should uniquely identify each row in a dataframe. The foreign key only has to match with a primary key. Sometimes you will have multiple keys.

3.8.2 Joins

The dplyr join functions appear mirror the PostgreSQL join function. The primary join function in dplyr is the left_join which takes a primary table, a secondary table containing information that will be added to the primary table, and a by argument which takes the key that is common across data tables. If a row in the first dataframe doesn’t have a match then an NA will be added. If an ID is in the secondary table but not in the first, then it will be dropped. We can also specify a vector of keys.

right_join treats the second table taken as the primary data frame, and ultimately does the opposite of the left_join. These are designed to work for data frames, but they also works with tibbles and references. Here’s an example:

tb1 <- tibble(key = letters,
              variable1 = rnorm(key, 0, 1),
              variable2 = rnorm(key, 1, 1))
tb2 <- tibble(key = letters[1:10],
               variable3 = rnorm(key, 2, 1))

# Left Join: Preserve all the first table
left_joined <- left_join(tb1, tb2, by = c("key"))
head(left_joined)

## # A tibble: 6 x 4
##   key   variable1 variable2 variable3
##   <chr>     <dbl>     <dbl>     <dbl>
## 1 a         2.08      0.953     2.75 
## 2 b        -0.944    -0.768     1.74 
## 3 c        -0.675     0.390     2.62 
## 4 d         1.02      1.59      4.05 
## 5 e        -0.818     2.20      0.844
## 6 f        -0.957     1.21      3.35

nrow(left_joined)

## [1] 26

left_joined$key

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

# Right Join: Preserve all the second table
right_joined <- right_join(tb1, tb2, by = c("key"))
head(right_joined)

## # A tibble: 6 x 4
##   key   variable1 variable2 variable3
##   <chr>     <dbl>     <dbl>     <dbl>
## 1 a         2.08      0.953     2.75 
## 2 b        -0.944    -0.768     1.74 
## 3 c        -0.675     0.390     2.62 
## 4 d         1.02      1.59      4.05 
## 5 e        -0.818     2.20      0.844
## 6 f        -0.957     1.21      3.35

nrow(right_joined)

## [1] 10

right_joined$key

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

There are also some alternatives. inner_join is the most exclusive as it returns the rows present in both datasets. full_join is the most inclusive, returning all rows that appear in either dataset. All joins rely upon the same syntax.

tb1 <- tibble(key = c("A", "B", "C", "D", "E"),
              variable1 = rnorm(key, 0, 1),
              variable2 = rnorm(key, 1, 1))
tb2 <- tibble(key = c("B", "C", "D", "X", "Y", "Z"),
               variable3 = rnorm(key, 2, 1)) 

# Inner join: only overlapping rows (B, C, D)
inner_joined <- inner_join(tb1, tb2, by = c("key"))
head(inner_joined)

## # A tibble: 3 x 4
##   key   variable1 variable2 variable3
##   <chr>     <dbl>     <dbl>     <dbl>
## 1 B         0.126     1.41       2.59
## 2 C         0.473     0.279      1.33
## 3 D        -0.187     0.410      5.17

nrow(inner_joined)

## [1] 3

inner_joined$key

## [1] "B" "C" "D"

# Full join: All rows included
full_joined <- full_join(tb1, tb2, by = c("key"))
head(full_joined)

## # A tibble: 6 x 4
##   key   variable1 variable2 variable3
##   <chr>     <dbl>     <dbl>     <dbl>
## 1 A        -1.15      0.597     NA   
## 2 B         0.126     1.41       2.59
## 3 C         0.473     0.279      1.33
## 4 D        -0.187     0.410      5.17
## 5 E        -1.62     -0.170     NA   
## 6 X        NA        NA          3.01

nrow(full_joined)

## [1] 8

full_joined$key

## [1] "A" "B" "C" "D" "E" "X" "Y" "Z"

You use filter() to return a subset of rows. But you can also use filtering joins to take two dataframes and then return a copy of the primary data frame that has been filtered, not augmented. Using semi_join(), we can filter the first dataframe by whether there is a match in the second dataframe. This is useful for filtering on complex criteria and may be easier than creating a set of filter statements. In other words, semi joins are much more efficient:

# Semi join: Filter first data frame by matches in the second
# This will not include the new column
semijoined <- semi_join(tb1, tb2, by = c("key"))

There are also anti-joins, these are the opposite of semi-joins and return the rows in table 1 that do not match in table 2:

# Anti join: Records in the first table that are not in the second
# This will not include the new column
antijoined <- anti_join(tb1, tb2, by = c("key"))

3.9 External Database Access

In addition to being a grammar of data manipulation, dplyr offers an opportunity to engage with databases that are not local. Tabular data can be in a data table or a database, and these verbs can be used to deal with all of these different data formats. This universality means you can store large datasets on a database and then manipulate them. For example, if someone uses data.table you can still use functions on it. You can also communicate with a foreign SQL database:

# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
                   host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "student",
                   password = "datacamp")

# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
glimpse(nycflights)

## Observations: ??
## Variables: 17
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ tailnum   <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ flight    <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ origin    <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest      <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time  <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance  <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
## $ minute    <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...

# Ordered, grouped summary of nycflights
flight_summaries <- nycflights %>%
  group_by(carrier) %>%
    summarise(n_flights = n(), avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
      arrange(avg_delay)

head(flight_summaries)

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric

## # Source:     lazy query [?? x 3]
## # Database:   mysql 5.6.34-log
## #   [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
##   carrier n_flights avg_delay
##   <chr>       <dbl>     <dbl>
## 1 AS            714    -9.86 
## 2 HA            342    -6.92 
## 3 AA          32729     0.356
## 4 DL          48110     1.63 
## 5 VX           5162     1.75 
## 6 US          20536     2.06

4 Concluding Thoughts

I hope I’ve convinced you that the Tidyverse can be useful for data processing and manipulation! There is much more to these packages, their functionality, and the broader Tidyverse than I am able to present. For example, check out the other packages that I haven’t discussed, including ggplot2.

If you have any questions, please feel free to contact me at campbell.1721@osu.edu! You can also access other workshops I’ve conducted through my website or my GitHub. Thank you!

Data Manipulation in the Tidyverse

Benjamin W. Campbell, Senior Fellow, Program in Statistics and Methodology (PRISM)

25 September, 2018