- Manipulating Data with dplyr Package
- Grouping and Chaining with dplyr package
- Tidying Data with tidyr package
- Type one column headers are values not variable names
- Type two multiple variables are stored in one columnm
- Type three Variables are stored in both rows and columns
- Type four multiple types of observational units are stored in the same table
- Type five a single observation unit is stored in multiple tables
- Brings five type together to deal with real data
- Dates and Times with lubridate
More details can be found in the html file here.
Manipulating Data with dplyr
Package
dplyr
is a fast and powerful R package written by Hadley Wickham and Romain Francois. The dplyr
philosophy is to have small functions that each do one thing well.
One unique aspect of dplyr
is that the same set of tools allow you to work with tabular data from a cariety of sources, including
- data frame
- data tables
- databases
multidimensional arrays
Step 1: download data and load it to R.
if(!file.exists("data")) dir.create("data")
fileUrl <- "http://cran-logs.rstudio.com/2014/2014-07-08.csv.gz"
download.file(fileUrl, "./data/path2csv.csv")
mydf <- read.csv("./data/path2csv.csv", stringsAsFactors = FALSE)
- Step 2: summary statistics.
dim(mydf)
head(mydf)
- Step 3: library
dplyr
package.
library(dplyr)
# check version: you need to have version 0.4.0 or later
packageVersion("dplyr")
From now on, we will focus on how to manipulate data with dplyr
package.
- Step 4: The first step of working with data in
dplyr
is to load the data into what the package author called a “data frame tbl” or “tbl_df”.
cran <- tbl_df(mydf)
- Step 5: The main advantage to using
tbl_df
over a regular data frame is the printing. The output oftbl_df
is much more informative and compact than what we would get if we printed the original data frame(mydf) to the console.(dplyr
shows us the first 10 rows of data and only as many as columns as fit neatly in our console. At the bottom, we see the names and classes for any variable for variables that didn’t fit on our screen.)
cran
head(mydf) # take `head` function to avoid too large data set printing
Then we will focus the five manipulation tasks:
- `select()`
- `filter()`
- `arrange()`
- `mutate()`
- `summarize()`
- Step 6: select three variables of cran(we don’t need to use
cran$ip_id
indplyr
package, so$
can be ignored.)
select(cran, ip_id, package, country)
- Step 7: select a sequence of columns.()
select(cran, r_arch:country)
- Step 8: throw away one column.(the negative sign in front of time tells us we don’t want the time column.)
select(cran, -time)
- Step 9: throw away multiple columns.
select(cran, -(date:size))
- Step 10: use
filter
function to select all rows for which the package variable is equal to “swirl”.
filter(cran, package == "swirl")
- Step 11: filter multiple rows.(note that the conditions are separated by commas.)
filter(cran, r_version == "3.1.1", country == "US")
- Step 12: filter rows corresponding to users in “IN” running an R version that is less than or equal to “3.0.2”
filter(cran, country == "IN", r_version <= "3.0.2")
- Step 13: filter rows corresponding to users in “US” or “IN”.
filter(cran, country == "US" | country == "IN")
- Step 14: filter rows for which size is strictly greater than 100500 and r_os equals “linux-gnu”
filter(cran, size > 100500, r_os == "linux-gnu")
- Step 15: filter the rows for which the r_version is not missing.
filter(cran, !is.na(r_version))
- Step 16: Some times we want to reorder the rows of a dataset according to the value of a particular variable. Reorder cran2 such that ip_id is in ascending.
cran2 <- select(cran, size:ip_id)
arrange(cran2, ip_id)
- Step 17: reorder crans such that ip_id is in descending order.
arrange(cran2, desc(ip_id))
- Step 18: reorder cran2 using multiple variables.
arrange(cran2, package, ip_id)
- Step 19: reorder cran2 using this order:
country(ascending), r_version(descending), ip_id(ascending)
.
arrange(cran2, country, desc(r_version), ip_id)
- Step 20: add a column called
size_mb
that contains the download size in megabytes.
cran3 <- select(cran, ip_id, package, size)
cran3
mutate(cran3, size_mb = size / 2^20)
- Step 21: One very nice feature of
mutate()
is that you can use the value computed for your second column(size_mb)
to create a third column(size_rb)
.
mutate(cran3, size_mb = size/2^20, size_gb = size_mb/2^10)
- Step 22: add a new variable “correct_size = size+1000”
mutate(cran3, correct_size = size + 1000)
- Step 23:
summarize()
collapses the dataset to a single row. Calculate the average download size.
summarize(cran, ave_bytes = mean(size))
Grouping and Chaining with dplyr
package
The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. THe group_by()
function is reponsible for doing this.
- Step 1: group cran by the package variable and store the result in a new variable called by_package. (At the top of the output above, you’ll see ‘Groups:package’. Eveything looks the same, but now any operation we apply to the grouped data will tabke place on a per package basis.)
by_package <- group_by(cran, package)
by_package
- Step 2: summarize the mean(size) to by_package
summarize(by_package, mean(size))
- Step 3: Summarize more on by_package
# Compute four values, in the following order, from
# the grouped data:
#
# 1. count = n()
# 2. unique = n_distinct(ip_id)
# 3. countries = n_distinct(country)
# 4. avg_bytes = mean(size)
#
# A few thing to be careful of:
#
# 1. Separate arguments by commas
# 2. Make sure you have a closing parenthesis
# 3. Check your spelling!
# 4. Store the result in pack_sum (for 'package summary')
#
# You should also take a look at ?n and ?n_distinct, so
# that you really understand what is going on.
# n() contains the total number of rows. n_distinct() gives the total number of unique downloads for each package.
pack_sum <- summarize(by_package, count = n(), unique = n_distinct(ip_id), countries = n_distinct(country), avg_bytes = mean(size))
- Step 4: determine the 0.99 quantile.
quantile(pack_sum$count, probs = 0.99)
- Step 5: filter rows with count larger than 0.99 quantile.
top_counts <- filter(pack_sum, count > 679)
top_counts
- Step 6: view all rows sinze
dplyr
only shows us the first 10 rows.
View(top_counts)
- Step 7: arrange the rows of
top_counts
based on the “count” with descending order and assign the result totop_count_sorted
.
top_counts_sorted <- arrange(top_counts, desc(count))
- Step 8: View
top_counts_sorted
View(top_counts_sorted)
- Step 9: find 0.99 quantile for the “unique” vairable
quantile(pack_sum$unique, probs = 0.99)
- Step 10: filter rows with unique > 465
top_unique <- filter(pack_sum, unique > 465)
- Step 11: View the result
View(top_unique)
- Step 12: arrange top_unique by unque in descending order.
top_unique_sorted <- arrange(top_unique, desc(unique))
- Step 13: View the sorted data
View(top_unique_sorted)
- Step 14: Chaining allows you to string together multiple function calls in a way that is compact and readable, while still accomplishing the desired result. To make it more
concrete, let’s compute our last popularity metric from scratch, starting with our original data.
# Don't change any of the code below. Just type submit()
# when you think you understand it.
# We've already done this part, but we're repeating it
# here for clarity.
by_package <- group_by(cran, package)
pack_sum <- summarize(by_package,
count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size))
# Here's the new bit, but using the same approach we've
# been using this whole time.
top_countries <- filter(pack_sum, countries > 60)
result1 <- arrange(top_countries, desc(countries), avg_bytes)
# Print the results to the console.
print(result1)
equal to
# Don't change any of the code below. Just type submit()
# when you think you understand it. If you find it
# confusing, you're absolutely right!
result2 <-
arrange(
filter(
summarize(
group_by(cran,
package
),
count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size)
),
countries > 60
),
desc(countries),
avg_bytes
)
print(result2)
equal to
# Read the code below, but don't change anything. As
# you read it, you can pronounce the %>% operator as
# the word 'then'.
#
# Type submit() when you think you understand
# everything here.
result3 <-
cran %>%
group_by(package) %>%
summarize(count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size)
) %>%
filter(countries > 60) %>%
arrange(desc(countries), avg_bytes)
# Print result to console
print(result3)
Tidying Data with tidyr
package
Tidy data is formatted in a standard way that facilitates exploration and analysis and works seamlessly with other tidy data tools. Specifically, tidy data satisfies three conditions:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observation unit forms a table.
five types of messy data are:
- Column headers are values not variable names.
- Multiple vatiables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
Type one: column headers are values not variable names
- Step 1: Create a data frame “students” that satisfies the type one condition. The students dataset actually has three variables: grade, sex and count.
students <- data.frame(grade = toupper(letters[1:5]), male = c(1, 5, 5, 5, 7), female = c(5, 0, 2, 5, 4))
- Step 2: To tidy the students data, we need to have one column for rach of these three variables.(
gather
intidyr
package takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You usegather()
when you notice that you have columns that are not variables.)(Note that the minus sign before grade, which says we want to gather all columns except grade.)
library(tidyr)
gather(students, sex, count, -grade)
Type two: multiple variables are stored in one columnm.
- Step 1: Create data set “student2” to satisfy the type two condition.(1 and 2 means class)
students2 <- data.frame(grade = LETTERS[1:5], male_1 = c(3, 6, 7, 4, 1), female_1 = c(4, 4, 4, 0, 1), male_2 = c(3, 3, 3, 8, 2), female_2 = c(4, 5, 8, 1, 7))
- Step 2: Using
gather
to stack columns of students2, like we just did with students.
res <- gather(students2, sex_class, count, - grade)
- Step 3:
separate
function is for the purpose of separating one column into multiple columns.
separate(data = res, col = sex_class, into = c("sex", "class"))
- Step 4: Using chain
%>%
function to do step 2-3.
library(dplyr)
students2 %>%
gather(sex_class, count, - grade) %>%
separate(col = sex_class, into = c("sex", "class")) %>%
print
Type three: Variables are stored in both rows and columns.
- Step 1: Create dataset ‘students3’ satisfying type three condition.
students3 <- data.frame(name = rep(c("Sally", "Jeff", "Roger", "Karen", "Brian"), each = 2), test = rep(c("midterm", "final"), 5), class1 = c("A", "C", rep(NA, 6), "B", "B"), class2 = c(rep(NA, 2), "D", "E", "C", "A", rep(NA, 4)), class3 = c("B", "C", rep(NA, 4), rep("C", 2), rep(NA, 2)), class4 = c(rep(NA, 2), "A", "C", rep(c(NA, "A", NA), each = 2)), class5 = c(rep(NA, 4), "B", "A", rep(NA, 2), "A", "C"))
# set attributes to the five column same
attributes(students3$class1) <- list(levels = LETTERS[1:5], class = "factor")
attributes(students3$class2) <- list(levels = LETTERS[1:5], class = "factor")
attributes(students3$class3) <- list(levels = LETTERS[1:5], class = "factor")
attributes(students3$class4) <- list(levels = LETTERS[1:5], class = "factor")
attributes(students3$class5) <- list(levels = LETTERS[1:5], class = "factor")
- Step 2:
gether()
function should be used to gether class.
students3 %>%
gather(class, grade, class1:class5, na.rm = TRUE) %>%
print
- Step 3:
spread()
function should be used to spread test column. (Spread a key-value pair across multiple columns)
students3 %>%
gather(class, grade, class1:class5, na.rm = TRUE) %>%
spread(test, grade) %>%
print
- Step 4: Set the class to be 1:5.
students3 %>%
gather(class, grade, class1:class5, na.rm = TRUE) %>%
spread(test, grade) %>%
mutate(class = extract_numeric(class)) %>%
print
Type four: multiple types of observational units are stored in the same table.
- Step 1: Create a dataset satisfying type four condition.
students4 <- data.frame(id = rep(c(168, 588, 710, 731, 908), each = 2), name = rep(c("Brian", "Sally", "Jeff", "Roger", "Karen"), each = 2), sex = c(rep(c("F", "M"), each = 2), rep(c("M", "F", "M"), each = 2)), class = c(1, 5, 1, 3, 2, 4, 2, 5, 3, 4), midterm = c("B", "A", "A", "B", "D", "A", "C", "B", "C", "A"), final = c("B", "C", "C", "C", "E", "C", "A", "A", "C", "A"))
- Step 2: At first glance, there doesn’t seem to be much of a problem with students4. All columns are variables and all rows are observations. However, notice that each id, name, and sex is repeated twice, which seems quite redundant. This is a hint that our data contains multiple observational units in a single table.Our solution will be to break students4 into two separate tables – one containing basic student information (id, name, and sex) and the other containing grades (id, class, midterm, final).
student_info <- students4 %>%
select(id, name, sex) %>%
unique %>%
print
gradebook <- students4 %>%
select(id, class, midterm, final) %>%
print
Type five: a single observation unit is stored in multiple tables.
- Step 1: Create dataset satisfying type five condition.
# passed
passed <- data.frame(name = c("Brian", "Roger", "Roger", "Karen"), class = c(1, 2, 5, 4), final = c("B", "A", "A", "A"))
attributes(passed$final) <- list(levels = LETTERS[1:5], class = "factor")
passed$name <- as.character(passed$name)
# failed
failed <- data.frame(name = as.character(c("Brian", "Sally", "Sally", "Jeff", "Jeff", "Karen")), class = c(5, 1, 3, 2, 4, 3), final = c("C", "C", "C", "E", "C", "C"))
attributes(failed$final) <- list(levels = LETTERS[1:5], class = "factor")
failed$name <- as.character(failed$name)
- Step 2: using
mutate()
to add a column to passed dataset.
passed <- mutate(passed, status = "passed")
failed <- mutate(failed, status = "failed")
- Step 3: bind passed and failed datasets.
bind_rows(passed, failed)
Brings five type together to deal with real data.
- Step 1: Create data set “sat”.
sat <- data.frame(score_range = c("700-800", "600-690", "500-590", "400-490", "300-390", "200-290"),
read_male = c(40151, 121950, 227141, 241554, 113568, 30728),
read_fem = c(38898, 126084, 259553, 296793, 133473, 29154),
read_total = c(79049, 24803, 486694, 539347, 247041, 59882),
math_male = c(74461, 162564, 233141, 204670, 82468, 18788),
math_fem = c(46040, 133954, 257678, 288696, 131025, 16562),
math_toal = c(120501, 196518, 490819, 493366, 213493, 45350),
write_male = c(31574, 100963, 202326, 262623, 146106, 32500),
write_fem = c(39101, 125368, 247239, 302933, 144381, 24933),
write_total = c(60675, 226331, 449565, 565556, 2904787, 57433))
sat <- tbl_df(sat)
- Step 2:
# Accomplish the following three goals:
#
# 1. select() all columns that do NOT contain the word "total",
# since if we have the male and female data, we can always
# recreate the total count in a separate column, if we want it.
# Hint: Use the contains() function, which you'll
# find detailed in 'Special functions' section of ?select.
#
# 2. gather() all columns EXCEPT score_range, using
# key = part_sex and value = count.
#
# 3. separate() part_sex into two separate variables (columns),
# called "part" and "sex", respectively. You may need to check
# the 'Examples' section of ?separate to remember how the 'into'
# argument should be phrased.
#
sat %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(col = part_sex, into = c("part", "sex")) %>%
print
- Step 3:
group_by()
function to group the data by part and sex. Andmutate()
function to add two new columns, whose values automativally computed group-by-group.
# Append two more function calls to accomplish the following:
#
# 1. Use group_by() (from dplyr) to group the data by part and
# sex, in that order.
#
# 2. Use mutate to add two new columns, whose values will be
# automatically computed group-by-group:
#
# * total = sum(count)
# * prop = count / total
#
sat %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(part_sex, c("part", "sex")) %>%
group_by(part, sex) %>%
mutate(total = sum(count),
prop = count/ total) %>%
print
Dates and Times with lubridate
- step 1: load package and deal with date.
lct <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
library(lubridate)
this_day <- today()
this_day
year(this_day)
month(this_day)
day(this_day)
wday(this_day)
wday(this_day, label = TRUE)
- Step 2: deal with date and time combinations.
this_moment <- now()
this_moment
hour(this_moment)
minute(this_moment)
second(this_moment)
- Step 3:
ymd()
dmy()
hms()
ymd_hms()
, etc.
my_date <- ymd("1989-05-17")
my_date
class(my_date)
update(this_moment, hours = 8, minutes = 34, seconds = 55)
this_moment
this_moment <- update(this_moment, hours = 15, minutes = 54)
this_moment
- Step 4: Different time zone.
nyc <- now("America/New_York")
depart <- nyc + days(2)
depart
depart <- update(depart, hours = 17, minutes = 34)
arrive <- depart + hours(15) + minutes(50)
arrive <- with_tz(arrive, "Asia/Hong_Kong")
- Dtep 5: Time difference.
last_time <- mdy("June 17, 2008", tz = "Singapore")
how_long <- interval(last_time, arrive)
how_long
as.period(how_long)