Welcome to the Program in Statistics and Methodology’s (PRISM) Data Manipulation in the Tidyverse Workshop. This workshop was delivered on 25 September, 2018 as part of PRISM’s regularly scheduled programming and Data Analytics Month @ Ohio State.
The Tidyverse is a suite of packages and tools designed to help data scientists import, clean, manipulate, and visualize their data. The Tidyverse packages are unified by a shared philosophy of data structure (“tidy data”) and programming. The Tidyverse suite of packages consists of readr
, tibble
, tidyr
, dplyr
, purrr
, ggplot2
, stringr
, and forcats
. The Tidyverse, itself, is also a package that can be installed and accessed.
set.seed(1234)
install.packages(c("tidyverse", "testit", "microbenchmark", "RMySQL"), repos = "http://cran.us.r-project.org")
## Installing packages into '/Users/benjamincampbell/Library/R/3.4/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/vk/31y3cgkd3l52_0ynqgx9lrxr0000gn/T//RtmpcKo351/downloaded_packages
library("tidyverse")
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("testit")
library("microbenchmark")
library("RMySQL")
## Loading required package: DBI
There are significant differences between how Base R and the Tidyverse handle particular tasks. Typically, proponents of the Tidyverse argue that there are several reasons it should be used:
%>%
).There are of course some drawbacks for Base R users:
tbl
) may not behave well with other functions, unconventional row labeling behavior.By the end of this workshop, it is my hope that you will feel comfortable using the Tidyverse to tidy and transform data. You will be familiar with the basic Tidyverse packages for tidying data (tibble
, tidyr
) and manipulating data (dplyr
).
As this workshop is primarily focused on wrangling data, we will punt on data importation (haven
, readr
) and data visualization (ggplot2
) in the Tidyverse.
There are two packages for producing “tidy” data within the Tidyverse framework. The first, tibble
, is a package introducing a new class of data frames, the tbl
. The second, tidyr
, is a package designed to assist in producing “tidy” data. Tidy data is described as data where each row is a different observation and each column is a different feature measured on each observation.
tibble
The tibble
package introduces a new class of data frame, the tbl
, which has a lot of cool features:
tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
names(data.frame('ben likes cats' = 1))
v. names(tibble('ben likes cats' = 1))
tibble(x = 1:5, y = x - mean(x))
There are a few things, however, that trip people up:
Here are some examples:
# Input types retain type (no factor switching)
# Base R data.frame
df <- data.frame(A = letters)
class(letters)
## [1] "character"
class(df$A)
## [1] "factor"
# tibble tibble
tib <- tibble(A = letters)
class(tib$A)
## [1] "character"
# tibble data_frame
tib_df <- data_frame(A = letters)
class(tib_df$A)
## [1] "character"
# Data frames with list-columns
tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
## # A tibble: 3 x 2
## x y
## <int> <list>
## 1 1 <int [5]>
## 2 2 <int [10]>
## 3 3 <int [20]>
# Column naming
names(data.frame('ben likes cats' = 1))
## [1] "ben.likes.cats"
names(tibble('ben likes cats' = 1))
## [1] "ben likes cats"
# Sequential naming
tibble(x = 1:5, y = x - mean(x))
## # A tibble: 5 x 2
## x y
## <int> <dbl>
## 1 1 -2
## 2 2 -1
## 3 3 0
## 4 4 1
## 5 5 2
You can also coerce data to a tibble using the as_tibble()
command. Such a command is written explicitly to be much more efficient than the base as.data.frame()
command.
if (requireNamespace("microbenchmark")) {
l <- replicate(26, sample(100), simplify = FALSE)
names(l) <- letters
microbenchmark::microbenchmark(
as_tibble(l),
as.data.frame(l)
)
}
## Unit: microseconds
## expr min lq mean median uq max
## as_tibble(l) 367.012 401.799 565.8195 443.115 603.5645 2341.569
## as.data.frame(l) 1180.468 1227.057 1469.4458 1299.943 1513.2675 3359.543
## neval cld
## 100 a
## 100 b
When printed, tibbles will be much more manageable, printing only the first ten rows and all the columns that will fit on one screen. This of course can be adjusted with a series of options:
options(tibble.print_max = n, tibble.print_min = m)
: if there are more than n rows, print only the first m rows. Use options(tibble.print_max = Inf)
to show all rows, similar to the default print method for data frames.options(tibble.width = Inf)
will always print all columns, regardless of the width of the screen.tibble(alphabet = LETTERS)
## # A tibble: 26 x 1
## alphabet
## <chr>
## 1 A
## 2 B
## 3 C
## 4 D
## 5 E
## 6 F
## 7 G
## 8 H
## 9 I
## 10 J
## # ... with 16 more rows
Tibbles have more consistent behavior when subsetting. When subsetting or indexing data frames, using square brackets [,]
may return a vector or a data frame. However, when using square bracket indexing [,]
for tibbles, a tibble will always be returned. When you want a single column vector from a tibble you may use $
or [[]]
indexing.
# Indexing and subsetting using square brackets
df1 <- data.frame(x = letters, y = 1:length(letters))
class(df1[1:10,]) # indexing > 1 rows
## [1] "data.frame"
class(df1[,1]) # indexing 1 column
## [1] "factor"
df2 <- tibble(x = letters, y = 1:length(letters))
class(df2[1:10,]) # indexing > 1 rows
## [1] "tbl_df" "tbl" "data.frame"
class(df2[,1]) # indexing 1 column
## [1] "tbl_df" "tbl" "data.frame"
# Indexing columns to vectors
class(df2[[1]])
## [1] "character"
class(df2$x)
## [1] "character"
Data frames allow for partial matching for column names using the $
indexing method. This does not work for tibbles.
df <- data.frame(let = letters)
df$l
## [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
df2 <- tibble(let = letters)
df2$l
## Warning: Unknown or uninitialised column: 'l'.
## NULL
Finally, tibbles are more conservative in how they handle data recycling relative to data frames. In data frames, recycling occurs when full recycling is allowed. For example, if a data frame is to have 20 rows and for one column only 5 values are supplied, those 5 values will be repeated 4 times to populate all 20 rows. This will not work if a data frame is to have 20 rows and for one column only 6 values are supplied, as 20 divided by 6 does not produce an integer. Tibbles will only recycle values of length 1. In other words, you cannot truly recycle values with tibbles, but you can broadcast.
# Tibbles and data frames allow for broadcasting, expanding b to match the length of letters
data.frame(a = letters[1:4], b = 1)
## a b
## 1 a 1
## 2 b 1
## 3 c 1
## 4 d 1
tibble(a = letters[1:4], b = 1)
## # A tibble: 4 x 2
## a b
## <chr> <dbl>
## 1 a 1
## 2 b 1
## 3 c 1
## 4 d 1
# Data frames will allow for full recycling
data.frame(a = letters[1:4], b = 1:2)
## a b
## 1 a 1
## 2 b 2
## 3 c 1
## 4 d 2
# but not partial recycling
testit::has_error(data.frame(a = letters[1:4], b = 1:3))
## [1] TRUE
# Neither of these will work for tibbles
testit::has_error(tibble(a = letters[1:4], b = 1:2))
## [1] TRUE
testit::has_error(tibble(a = letters[1:4], b = 1:5))
## [1] TRUE
tidyr
Central to data analysis, and even data preparation, is structuring datasets to the format necessary for analysis. Hadley Wickham refers to this step as data tidying. There are several principles of tidy data laid out in Wickham’s The Journal of Statistical Software piece “Tidy Data.” These include:
Tidy data is designed to work well with vectorized languages like R because it ensures that observations’ variable values are always paired. Data is not tidy when:
Wickham’s tidyr
package is designed to help analysts deal with untidy data and introduces three tools: gathering, separating, and spreading.
The gather()
command will take multiple columns and gather them into tidy data, making landscape data into portrait data. If we have columns that are not variables, then gather is useful. Here is an example of how gather()
is useful. In this case, we have heart rates for 3 people over 2 treatment conditions.
messy <- tibble(
name = c("Bob", "Peter", "Jan"),
a = c(67, 130, 150),
b = c(56, 121, 20)
)
messy
## # A tibble: 3 x 3
## name a b
## <chr> <dbl> <dbl>
## 1 Bob 67 56
## 2 Peter 130 121
## 3 Jan 150 20
In this case we have three variables but currently we only have one of them in a column (name). There are two other variables that we need to create: treatment and heart rate. We can use gather
to produce these variables and tidy the data:
messy %>%
gather(drug, heartrate, a:b)
## # A tibble: 6 x 3
## name drug heartrate
## <chr> <chr> <dbl>
## 1 Bob a 67
## 2 Peter a 130
## 3 Jan a 150
## 4 Bob b 56
## 5 Peter b 121
## 6 Jan b 20
If variables are grouped together into one, we may be interested in using the separate()
function to disentangle them. This function is related to extract()
, which uses regular expressions for splitting instead of pattern. Here we have some fake data examining two outcomes observed over two different time periods according to a treatment or control condition.
messy <- data.frame(
id = 1:4,
trt = sample(rep(c('control', 'treatment'), each = 2)),
outcome1.before = runif(4),
outcome2.before = runif(4),
outcome1.after = runif(4),
outcome2.after = runif(4)
)
messy
## id trt outcome1.before outcome2.before outcome1.after
## 1 1 control 0.9751596 0.9127080 0.5475543
## 2 2 control 0.8759398 0.7397774 0.8516214
## 3 3 treatment 0.4292365 0.9504254 0.5014444
## 4 4 treatment 0.3547238 0.5261304 0.9361675
## outcome2.after
## 1 0.5778685
## 2 0.8627265
## 3 0.7516653
## 4 0.6981145
We will begin tidying this messy data by using the gather()
command to turn the treatment columns into key-value paired data according to the key id
and time
:
# Here we are using the pipe operator which takes an argument or object on the left and passess it as the first argument to the function on the right hand side.
# We are specifying two new columns for key and value and specifying that the variables other than id and trt should be gathered
tidier <- messy %>%
gather(time, value, -id, -trt)
tidier %>% head(8)
## id trt time value
## 1 1 control outcome1.before 0.9751596
## 2 2 control outcome1.before 0.8759398
## 3 3 treatment outcome1.before 0.4292365
## 4 4 treatment outcome1.before 0.3547238
## 5 1 control outcome2.before 0.9127080
## 6 2 control outcome2.before 0.7397774
## 7 3 treatment outcome2.before 0.9504254
## 8 4 treatment outcome2.before 0.5261304
Then, using separate()
we will split the key into the outcome number and the time period using regular expressions:
tidy <- tidier %>%
separate(time, into = c("outcome", "time"), sep = "\\.")
tidy %>% head(8)
## id trt outcome time value
## 1 1 control outcome1 before 0.9751596
## 2 2 control outcome1 before 0.8759398
## 3 3 treatment outcome1 before 0.4292365
## 4 4 treatment outcome1 before 0.3547238
## 5 1 control outcome2 before 0.9127080
## 6 2 control outcome2 before 0.7397774
## 7 3 treatment outcome2 before 0.9504254
## 8 4 treatment outcome2 before 0.5261304
The spread()
function takes two columns, a key and a value, and then spreads them into multiple columns. This takes portrait data and makes it into landscape data. If you have variables that are in rows instead of columns, this function is particularly useful. Here is an example:
# Generate fake stock price data
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 100, 10),
Y = rnorm(10, 125, 20),
Z = rnorm(10, 150, 4)
)
head(stocks)
## time X Y Z
## 1 2009-01-01 108.01125 93.08611 152.3230
## 2 2009-01-02 79.84799 121.25669 153.8985
## 3 2009-01-03 100.47396 136.35269 152.3498
## 4 2009-01-04 106.13097 128.25358 150.9811
## 5 2009-01-05 98.29982 97.56034 143.7360
## 6 2009-01-06 102.25760 87.32095 147.9675
# Gather to make these into long data
stocks_long <- stocks %>% gather(stock, price, -time)
head(stocks_long)
## time stock price
## 1 2009-01-01 X 108.01125
## 2 2009-01-02 X 79.84799
## 3 2009-01-03 X 100.47396
## 4 2009-01-04 X 106.13097
## 5 2009-01-05 X 98.29982
## 6 2009-01-06 X 102.25760
# Make this back into wide data, which is the same as the original
stocks_wide <- stocks_long %>% spread(stock, price)
head(stocks_wide)
## time X Y Z
## 1 2009-01-01 108.01125 93.08611 152.3230
## 2 2009-01-02 79.84799 121.25669 153.8985
## 3 2009-01-03 100.47396 136.35269 152.3498
## 4 2009-01-04 106.13097 128.25358 150.9811
## 5 2009-01-05 98.29982 97.56034 143.7360
## 6 2009-01-06 102.25760 87.32095 147.9675
setequal(stocks, stocks_wide)
## TRUE
# We can make this even wider where the row is the stock
stocks_wider <- stocks_long %>% spread(time, price)
head(stocks_wider)
## stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06
## 1 X 108.01125 79.84799 100.4740 106.1310 98.29982 102.25760
## 2 Y 93.08611 121.25669 136.3527 128.2536 97.56034 87.32095
## 3 Z 152.32303 153.89850 152.3498 150.9811 143.73599 147.96749
## 2009-01-07 2009-01-08 2009-01-09 2009-01-10
## 1 108.9240 98.52387 115.3718 104.5262
## 2 123.6391 138.74614 110.8400 132.5535
## 3 153.4331 151.80747 146.4084 151.2760
reshape2
When you’re interested in the general reshaping of data, the reshape2
package may be useful. Given that most analysis requires tidy data, the tidyr
package should suffice for most cases.
Data manipulation is a core task in data analysis. Data manipulation refers to the process of changing or modifying data so you can efficiently extract, filter, and transform it. In the Tidyverse, the dplyr
package consists of five “verbs” that are useful for data manipulation: select()
, mutate()
, filter()
, arrange()
, and summarize()
. dplyr
also assists in binding, grouping, and merging data, as well as accessing external databases.
Often we will want to select columns from a data frame. To do this in dplyr
there is the select()
function which takes a table and a number of variable names. You can also use -
to exclude some variables, or index variables using column numbers.
data(iris)
reduced <- iris %>% select(Species, Sepal.Width)
head(reduced)
## Species Sepal.Width
## 1 setosa 3.5
## 2 setosa 3.0
## 3 setosa 3.2
## 4 setosa 3.1
## 5 setosa 3.6
## 6 setosa 3.9
We can also link this with a series of helper functions to assist in selection. This makes select()
particularly helpful:
starts_with("X")
: every name that starts with “X”,ends_with("X")
: every name that ends with “X”,contains("X")
: every name that contains “X”,matches("X")
: every name that matches “X”, where “X” can be a regular expression,num_range("x", 1:5)
: the variables named x01, x02, x03, x04 and x05,one_of(x)
: every name that appears in x, which should be a character vector.# Helper functions are useful for selecting columns
s_cols <- iris %>% select(starts_with("S"))
colnames(s_cols)
## [1] "Sepal.Length" "Sepal.Width" "Species"
dot_cols <- iris %>% select(contains("."))
colnames(dot_cols)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
mutate()
does the opposite of select()
, it allows you to add new columns to a tibble. The syntax of mutate()
is similar to select()
. The first argument is a tibble, then the second is a statement about how to define the new column.
# You can add multiple columns at once
iris <- iris %>% mutate(Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width)
summary(iris$Sepal.Area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 15.66 17.66 17.82 20.32 30.02
summary(iris$Petal.Area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.110 0.420 5.615 5.794 9.690 15.870
filter()
filters out rows based upon some expression. Think about it as the row-wise function of select()
. We can narrow observations based upon Boolean logicals. filter()
will return all rows that pass the logical test that is specified. This returns a new dataset that must be saved.
setosa_df <- iris %>% filter(Species == "setosa")
large_sepals <- iris %>% filter(Sepal.Area > mean(Sepal.Area))
# Let's show how we can create combine these
large_irises <- iris %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
filter(Sepal.Area > mean(Sepal.Area) & Petal.Area > mean(Petal.Area))
head(large_irises)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 7.0 3.2 4.7 1.4 versicolor 22.40
## 2 6.4 3.2 4.5 1.5 versicolor 20.48
## 3 6.9 3.1 4.9 1.5 versicolor 21.39
## 4 6.5 2.8 4.6 1.5 versicolor 18.20
## 5 6.3 3.3 4.7 1.6 versicolor 20.79
## 6 6.6 2.9 4.6 1.3 versicolor 19.14
## Petal.Area
## 1 6.58
## 2 6.75
## 3 7.35
## 4 6.90
## 5 7.52
## 6 5.98
arrange()
reorders rows based upon certain operators. The analyst tells the function which columns to arrange the rows by. This will default to increasing/ascending order, but you can specify descending order using desc()
. You can also list a second variable as a tie breaker which will sort ties among the first variable. For a character variable, rows will be arranged in alphabetical order. For factors, they’ll be arranged according to the factor ordering. Let’s build on the prior data frame and arrange it in descending order of sepal size using desc()
.
head(large_irises)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 7.0 3.2 4.7 1.4 versicolor 22.40
## 2 6.4 3.2 4.5 1.5 versicolor 20.48
## 3 6.9 3.1 4.9 1.5 versicolor 21.39
## 4 6.5 2.8 4.6 1.5 versicolor 18.20
## 5 6.3 3.3 4.7 1.6 versicolor 20.79
## 6 6.6 2.9 4.6 1.3 versicolor 19.14
## Petal.Area
## 1 6.58
## 2 6.75
## 3 7.35
## 4 6.90
## 5 7.52
## 6 5.98
sorted_large_irises <- iris %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
filter(Sepal.Area > mean(Sepal.Area) & Petal.Area > mean(Petal.Area)) %>%
arrange(desc(Sepal.Area)) # removing desc() produces ascending order
head(sorted_large_irises)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 7.9 3.8 6.4 2.0 virginica 30.02
## 2 7.7 3.8 6.7 2.2 virginica 29.26
## 3 7.2 3.6 6.1 2.5 virginica 25.92
## 4 7.7 3.0 6.1 2.3 virginica 23.10
## 5 7.2 3.2 6.0 1.8 virginica 23.04
## 6 7.6 3.0 6.6 2.1 virginica 22.80
## Petal.Area
## 1 12.80
## 2 14.74
## 3 15.25
## 4 14.03
## 5 10.80
## 6 13.86
We can also use the summarise()
or summarize()
function to create a new dataset of summary statistics. These functions include the same syntax as the prior commands and allows you to choose summary statistics to include in a new aggregated dataframe. Aggregate functions should be used, these are functions that take on a vector of values and return a single number.
dplyr
also comes with a variety of aggregate functions that are in addition to those defined in R:
first(x)
: The first element of vector x.last(x)
: The last element of vector x.nth(x, n)
: The nth element of vector x.n()
: The number of rows in the data.frame or group of observations that summarise()
describes.n_distinct(x)
: The number of unique values in vector x.iris_summary <- iris %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
summarize(n_obs = n(),
n_species = n_distinct(Species),
average_sepal_area = mean(Sepal.Area),
average_petal_area = mean(Petal.Area))
iris_summary
## n_obs n_species average_sepal_area average_petal_area
## 1 150 3 17.82287 5.794067
dplyr
also has built-in means to group data. summarize()
returns aggregated measures for a vector or for an entire dataset. But, what if we are interested in doing aggregate measures for groups of rows? We can look at group-based differences by using the group_by()
function. This function takes a tibble and some group-indicator. The dataset will look the same, but then when you summarise()
a grouped dataset, the cool part comes in, you get grouped aggregate statistics.
iris_summary <- iris %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width) %>%
group_by(Species) %>%
summarize(n_obs = n(),
average_sepal_area = mean(Sepal.Area),
average_petal_area = mean(Petal.Area))
iris_summary
## # A tibble: 3 x 4
## Species n_obs average_sepal_area average_petal_area
## <fct> <int> <dbl> <dbl>
## 1 setosa 50 17.3 0.366
## 2 versicolor 50 16.5 5.72
## 3 virginica 50 19.7 11.3
The easiest way to combine datasets is through binding. Binding requires that dataframes have the same columns in the same order and meaning (row binding), or when they have the same rows and just different columns (column binding). In base R you would typically use rbind()
or cbind()
. dplyr
has an alternative option, bind_rows()
and bind_cols()
.
bind_rows()
combines two datasets with the same columns in the same orders to merge a dataset on top of one another. bind_cols()
combines two datasets with the same rows in the same order to merge a dataset side by side. This assumes that order is matching for rows or columns. Joins solve this problem in binding by using keys, but binds require correct orders. These are quicker and return a tibble. In addition, the syntax is a little more flexible as you can pass it lists of dataframes. You can also add the .id
argument that will give a new column that can help with grouping.
# Column binding is easy!
new_vars <- tibble(variable_1 = rnorm(nrow(iris), 0, 1),
variable_2 = rnorm(nrow(iris), 1, 1))
iris <- bind_cols(iris, new_vars)
# Subset into list
iris_subsetted <- list(filter(iris, Species == "setosa"),
filter(iris, Species == "versicolor"),
filter(iris, Species == "virginica"))
# See length of different data frames
unlist(lapply(iris_subsetted, nrow))
## [1] 50 50 50
# Bind rows together
iris_whole <- bind_rows(iris_subsetted)
# We can also use use purrr when we want to feed multiple objects torepeat some pipeline over an object
# This is useful when you are merging or joining multiple data frames together
library(purrr)
iris_whole_purr <- iris_subsetted %>% reduce(bind_rows)
# Objects are the same
setequal(iris_whole, iris_whole_purr)
## TRUE
Often our data in R exist across multiple datasets. As such, we will want to join datasets so that assembling datasets will be much easier and more effective. In base R there is the merge()
function. dplyr
’s join functions are better than base R’s merge()
in that it preserves row order, has better syntax, and can be used for different databases. The grammar of dplyr
’s data joins are similar to PostgreSQL’s grammar.
A join combines columns of two or more dataframe according to matching values in some identifier variables. To do this, analysts identify a set of keys that exist in both dataframes and combine tables according to matching key values in both dataframes. The first table’s key is the primary key and the second table’s key is the foreign key. The primary key should uniquely identify each row in a dataframe. The foreign key only has to match with a primary key. Sometimes you will have multiple keys.
The dplyr
join functions appear mirror the PostgreSQL join function. The primary join function in dplyr
is the left_join
which takes a primary table, a secondary table containing information that will be added to the primary table, and a by
argument which takes the key that is common across data tables. If a row in the first dataframe doesn’t have a match then an NA will be added. If an ID is in the secondary table but not in the first, then it will be dropped. We can also specify a vector of keys.
right_join
treats the second table taken as the primary data frame, and ultimately does the opposite of the left_join
. These are designed to work for data frames, but they also works with tibbles and references. Here’s an example:
tb1 <- tibble(key = letters,
variable1 = rnorm(key, 0, 1),
variable2 = rnorm(key, 1, 1))
tb2 <- tibble(key = letters[1:10],
variable3 = rnorm(key, 2, 1))
# Left Join: Preserve all the first table
left_joined <- left_join(tb1, tb2, by = c("key"))
head(left_joined)
## # A tibble: 6 x 4
## key variable1 variable2 variable3
## <chr> <dbl> <dbl> <dbl>
## 1 a 2.08 0.953 2.75
## 2 b -0.944 -0.768 1.74
## 3 c -0.675 0.390 2.62
## 4 d 1.02 1.59 4.05
## 5 e -0.818 2.20 0.844
## 6 f -0.957 1.21 3.35
nrow(left_joined)
## [1] 26
left_joined$key
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
# Right Join: Preserve all the second table
right_joined <- right_join(tb1, tb2, by = c("key"))
head(right_joined)
## # A tibble: 6 x 4
## key variable1 variable2 variable3
## <chr> <dbl> <dbl> <dbl>
## 1 a 2.08 0.953 2.75
## 2 b -0.944 -0.768 1.74
## 3 c -0.675 0.390 2.62
## 4 d 1.02 1.59 4.05
## 5 e -0.818 2.20 0.844
## 6 f -0.957 1.21 3.35
nrow(right_joined)
## [1] 10
right_joined$key
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
There are also some alternatives. inner_join
is the most exclusive as it returns the rows present in both datasets. full_join
is the most inclusive, returning all rows that appear in either dataset. All joins rely upon the same syntax.
tb1 <- tibble(key = c("A", "B", "C", "D", "E"),
variable1 = rnorm(key, 0, 1),
variable2 = rnorm(key, 1, 1))
tb2 <- tibble(key = c("B", "C", "D", "X", "Y", "Z"),
variable3 = rnorm(key, 2, 1))
# Inner join: only overlapping rows (B, C, D)
inner_joined <- inner_join(tb1, tb2, by = c("key"))
head(inner_joined)
## # A tibble: 3 x 4
## key variable1 variable2 variable3
## <chr> <dbl> <dbl> <dbl>
## 1 B 0.126 1.41 2.59
## 2 C 0.473 0.279 1.33
## 3 D -0.187 0.410 5.17
nrow(inner_joined)
## [1] 3
inner_joined$key
## [1] "B" "C" "D"
# Full join: All rows included
full_joined <- full_join(tb1, tb2, by = c("key"))
head(full_joined)
## # A tibble: 6 x 4
## key variable1 variable2 variable3
## <chr> <dbl> <dbl> <dbl>
## 1 A -1.15 0.597 NA
## 2 B 0.126 1.41 2.59
## 3 C 0.473 0.279 1.33
## 4 D -0.187 0.410 5.17
## 5 E -1.62 -0.170 NA
## 6 X NA NA 3.01
nrow(full_joined)
## [1] 8
full_joined$key
## [1] "A" "B" "C" "D" "E" "X" "Y" "Z"
You use filter()
to return a subset of rows. But you can also use filtering joins to take two dataframes and then return a copy of the primary data frame that has been filtered, not augmented. Using semi_join()
, we can filter the first dataframe by whether there is a match in the second dataframe. This is useful for filtering on complex criteria and may be easier than creating a set of filter statements. In other words, semi joins are much more efficient:
# Semi join: Filter first data frame by matches in the second
# This will not include the new column
semijoined <- semi_join(tb1, tb2, by = c("key"))
There are also anti-joins, these are the opposite of semi-joins and return the rows in table 1 that do not match in table 2:
# Anti join: Records in the first table that are not in the second
# This will not include the new column
antijoined <- anti_join(tb1, tb2, by = c("key"))
In addition to being a grammar of data manipulation, dplyr
offers an opportunity to engage with databases that are not local. Tabular data can be in a data table or a database, and these verbs can be used to deal with all of these different data formats. This universality means you can store large datasets on a database and then manipulate them. For example, if someone uses data.table
you can still use functions on it. You can also communicate with a foreign SQL database:
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")
# glimpse at nycflights
glimpse(nycflights)
## Observations: ??
## Variables: 17
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
## $ minute <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...
# Ordered, grouped summary of nycflights
flight_summaries <- nycflights %>%
group_by(carrier) %>%
summarise(n_flights = n(), avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(avg_delay)
head(flight_summaries)
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source: lazy query [?? x 3]
## # Database: mysql 5.6.34-log
## # [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
## carrier n_flights avg_delay
## <chr> <dbl> <dbl>
## 1 AS 714 -9.86
## 2 HA 342 -6.92
## 3 AA 32729 0.356
## 4 DL 48110 1.63
## 5 VX 5162 1.75
## 6 US 20536 2.06
I hope I’ve convinced you that the Tidyverse can be useful for data processing and manipulation! There is much more to these packages, their functionality, and the broader Tidyverse than I am able to present. For example, check out the other packages that I haven’t discussed, including ggplot2
.
If you have any questions, please feel free to contact me at campbell.1721@osu.edu! You can also access other workshops I’ve conducted through my website or my GitHub. Thank you!