# install.packages("tidyverse") # install if needed
library(tidyverse)
Lab 2: Data Manipulation and Visualization
In Lab 1, we briefly introduced what packages are in R and one specific package tidyverse. If you wish to learn more about tidyverse, click here for more information. Lab 2 will focus on two packages that are included in tidyverse:
dplyr for data manipulation
ggplot for data visualization
But first, remember to load the package.
1 Data Manipulation using dplyr
1.1 What is a Tidy Data Set?
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Three rules make a data tidy:
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
1.2 Create a Farm Business Data Set
# farmers' info
<- c("Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby")
name <- c("male", "male", "male", "female", "female", "female")
sex <- c(43, 60, 25, 50, 28, 58)
age
# types of farm
<- c("crop", "livestock", "urban", "dairy", "crop", "livestock")
type
# size of farm in acres
<- c(550, 800, 10, 600, 1000, 700)
size
# net annual cash return from ag businesses, in $1000
<- c(40, 90, 50, 90, 90, 95)
return
# combine the variables together as a data frame
<- data.frame(name, age, sex, type, size, return)
farm farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Alex | 25 | male | urban | 10 | 50 |
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
# glimpse the data set
glimpse(farm)
Rows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <dbl> 43, 60, 25, 50, 28, 58
$ sex <chr> "male", "male", "male", "female", "female", "female"
$ type <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
1.3 Important Functions in dplyr
There are six important functions in dplyr are:
select()
: pick variables by their namesfilter()
: pick observations by their valuesarrange()
: reorder the rowsmutate()
: create new variables with functions of existing variablessummarize()
: collapse many values down to a single summarygroup_by()
: groups data by one or more variables, allowing subsequent operations to be applied independently to each group
Combining with the pipe operator %>%
, dplyr can make data manipulation simple and intuitive.
You can always type “?FUNCTION_NAME” in the Console pane to check the R Documentation for the function. Try ?select
.
1.3.1 select()
select()
allows you to focus on the variables you’re interested in.
select(farm, c(type, size, return)) # select farm type, size and return
type | size | return |
---|---|---|
crop | 550 | 40 |
livestock | 800 | 90 |
urban | 10 | 50 |
dairy | 600 | 90 |
crop | 1000 | 90 |
livestock | 700 | 95 |
select(farm, sex:size) # select everything between sex and size
sex | type | size |
---|---|---|
male | crop | 550 |
male | livestock | 800 |
male | urban | 10 |
female | dairy | 600 |
female | crop | 1000 |
female | livestock | 700 |
select(farm, -name) # select everything but names
age | sex | type | size | return |
---|---|---|---|---|
43 | male | crop | 550 | 40 |
60 | male | livestock | 800 | 90 |
25 | male | urban | 10 | 50 |
50 | female | dairy | 600 | 90 |
28 | female | crop | 1000 | 90 |
58 | female | livestock | 700 | 95 |
1.3.2 filter()
filter()
allows you to subset observations based on their values.
filter(farm, size > 500) # select farms with size > 500
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
filter(farm, size > 500 & sex == "female") # select farms with size > 500 AND owned by female farmers
name | age | sex | type | size | return |
---|---|---|---|---|---|
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
1.3.3 arrange()
arrange()
orders the observations by one or more variables. Basically, it changes the order of rows.
arrange(farm, size) # order the data set by farm size, by default, in ascending order
name | age | sex | type | size | return |
---|---|---|---|---|---|
Alex | 25 | male | urban | 10 | 50 |
Henry | 43 | male | crop | 550 | 40 |
Gaby | 50 | female | dairy | 600 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
Larry | 60 | male | livestock | 800 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
arrange(farm, desc(size)) # change the ordering to descending
name | age | sex | type | size | return |
---|---|---|---|---|---|
Amy | 28 | female | crop | 1000 | 90 |
Larry | 60 | male | livestock | 800 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
Gaby | 50 | female | dairy | 600 | 90 |
Henry | 43 | male | crop | 550 | 40 |
Alex | 25 | male | urban | 10 | 50 |
1.3.4 mutate()
mudate()
modifies existing variables or adds new variables.
mutate(farm, return = return * 1000)
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40000 |
Larry | 60 | male | livestock | 800 | 90000 |
Alex | 25 | male | urban | 10 | 50000 |
Gaby | 50 | female | dairy | 600 | 90000 |
Amy | 28 | female | crop | 1000 | 90000 |
Ruby | 58 | female | livestock | 700 | 95000 |
mutate(farm, age.sq = age ^ 2)
name | age | sex | type | size | return | age.sq |
---|---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 | 1849 |
Larry | 60 | male | livestock | 800 | 90 | 3600 |
Alex | 25 | male | urban | 10 | 50 | 625 |
Gaby | 50 | female | dairy | 600 | 90 | 2500 |
Amy | 28 | female | crop | 1000 | 90 | 784 |
Ruby | 58 | female | livestock | 700 | 95 | 3364 |
mutate(farm, per.acre.return = return / size)
name | age | sex | type | size | return | per.acre.return |
---|---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 | 0.0727273 |
Larry | 60 | male | livestock | 800 | 90 | 0.1125000 |
Alex | 25 | male | urban | 10 | 50 | 5.0000000 |
Gaby | 50 | female | dairy | 600 | 90 | 0.1500000 |
Amy | 28 | female | crop | 1000 | 90 | 0.0900000 |
Ruby | 58 | female | livestock | 700 | 95 | 0.1357143 |
# Or, you can do all three in one step
mutate(farm,
return = return * 1000,
age.sq = age ^ 2,
per.acre.return = return / size
)
name | age | sex | type | size | return | age.sq | per.acre.return |
---|---|---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40000 | 1849 | 72.72727 |
Larry | 60 | male | livestock | 800 | 90000 | 3600 | 112.50000 |
Alex | 25 | male | urban | 10 | 50000 | 625 | 5000.00000 |
Gaby | 50 | female | dairy | 600 | 90000 | 2500 | 150.00000 |
Amy | 28 | female | crop | 1000 | 90000 | 784 | 90.00000 |
Ruby | 58 | female | livestock | 700 | 95000 | 3364 | 135.71429 |
# change the classes of variables
glimpse(farm) # view the data before changes
Rows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <dbl> 43, 60, 25, 50, 28, 58
$ sex <chr> "male", "male", "male", "female", "female", "female"
$ type <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
<- mutate(farm,
farm2 sex = as.factor(sex),
type = as.factor(type),
age = as.integer(age)
)glimpse(farm2) # view the data after changes
Rows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <int> 43, 60, 25, 50, 28, 58
$ sex <fct> male, male, male, female, female, female
$ type <fct> crop, livestock, urban, dairy, crop, livestock
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
The function else()
is often used in data manipulation, which assigns values to a variable based on whether a condition is satisfied.
mutate(farm,
size2 = ifelse(size > 600, "big", "small"),
dummy_urban = ifelse(type == "urban", 1, 0) # when testing for equality, use double ==
)
name | age | sex | type | size | return | size2 | dummy_urban |
---|---|---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 | small | 0 |
Larry | 60 | male | livestock | 800 | 90 | big | 0 |
Alex | 25 | male | urban | 10 | 50 | small | 1 |
Gaby | 50 | female | dairy | 600 | 90 | small | 0 |
Amy | 28 | female | crop | 1000 | 90 | big | 0 |
Ruby | 58 | female | livestock | 700 | 95 | big | 0 |
1.3.5 summarize()
summarize()
provides summary statistics, which always produce one single row if there are no grouping variables.
summarize(farm, tot.return = sum(return))
tot.return |
---|
455 |
summarize(farm, avg.return = mean(return))
avg.return |
---|
75.83333 |
summarize(farm,
youngest = min(age),
oldest = max(age),
median = median(age),
cor.size.return = cor(size, return))
youngest | oldest | median | cor.size.return |
---|---|---|---|
25 | 60 | 46.5 | 0.6787267 |
It is often the case that we wish to know the summary statstics by a certain groups, e.g. average return by gender. Therefore, the use of summarize()
is usually combined with group_by()
and the pipe operator %>%
.
1.3.6 group_by()
and %>%
1.3.6.1 group_by()
group_by()
groups data by named variables, the use of group_by()
itself does not change any variables, but only re-order the data, simlar to arrange()
.
group_by(farm, sex)
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Alex | 25 | male | urban | 10 | 50 |
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
1.3.6.2 %>%
However, then main purpose of group_by()
is to group your data to perform following operation. To achieve this, you will also need the pipe operator %>%
. Functioning like pipes, %>%
uses the output of one function as the input to the next function.
Suppose you wish to perform the steps below, based on the data farm:
- calculate the return per acre, called “per.acre.return”
- keep only the farms that are owned by farmers above 40 years old
- create a new data frame only contains: names and age of the farmers, and the return per acre
### without %>%
<- mutate(farm, per.acre.return = return / size)
farm_wo_pipe1 <- filter(farm_wo_pipe1, age > 40)
farm_wo_pipe2 <- select(farm_wo_pipe2, c(name, age, per.acre.return))
farm_wo_pipe3 farm_wo_pipe3
name | age | per.acre.return |
---|---|---|
Henry | 43 | 0.0727273 |
Larry | 60 | 0.1125000 |
Gaby | 50 | 0.1500000 |
Ruby | 58 | 0.1357143 |
### with %>%
<- farm %>% mutate(per.acre.return = return / size) %>%
farm_w_pipe filter(age > 40) %>%
select(name, age, per.acre.return)
farm_w_pipe
name | age | per.acre.return |
---|---|---|
Henry | 43 | 0.0727273 |
Larry | 60 | 0.1125000 |
Gaby | 50 | 0.1500000 |
Ruby | 58 | 0.1357143 |
1.3.6.3 Combining group_by()
with %>%
Now, let’s calculate summary statistics by groups, using group_by()
with %>%
.
%>% group_by(sex) %>% summarize(num.farmer = n(),
farm youngest = min(age),
oldest = max(age),
tot.return = sum(return),
avg.return = mean(return),
avg.per.acre.return = mean(return/size),
avg.size = mean(size))
sex | num.farmer | youngest | oldest | tot.return | avg.return | avg.per.acre.return | avg.size |
---|---|---|---|---|---|---|---|
female | 3 | 28 | 58 | 275 | 91.66667 | 0.1252381 | 766.6667 |
male | 3 | 25 | 60 | 180 | 60.00000 | 1.7284091 | 453.3333 |
1.3.7 Other Functions/Verbs
1.3.7.1 slice()
and Its Variants
You can use slice()
to select rows by position, or it variants
slice_head()
andslice_tail()
: to select first/last rowsslice_min()
andslice_max()
: to select rows with minimum/maximum valuesslice_sample()
: to select random samples
farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Alex | 25 | male | urban | 10 | 50 |
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
%>% slice(3) # pick the observation in row 3 farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Alex | 25 | male | urban | 10 | 50 |
%>% slice(1:3) # pick observations from row 1 through row 3 farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Alex | 25 | male | urban | 10 | 50 |
%>% slice_head(n = 3) # pick first 3 rows, slice_tail would pick the last 3 rows farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Henry | 43 | male | crop | 550 | 40 |
Larry | 60 | male | livestock | 800 | 90 |
Alex | 25 | male | urban | 10 | 50 |
%>% slice_min(age, n = 3) # pick 3 rows with the youngest ages, slice_max would pick 3 rows with the largest ages farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Alex | 25 | male | urban | 10 | 50 |
Amy | 28 | female | crop | 1000 | 90 |
Henry | 43 | male | crop | 550 | 40 |
%>% slice_sample(n = 3) # randomly pick 3 observations farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Gaby | 50 | female | dairy | 600 | 90 |
Amy | 28 | female | crop | 1000 | 90 |
Henry | 43 | male | crop | 550 | 40 |
%>% slice_sample(prop = 0.5) # randomly pick 50% of the data farm
name | age | sex | type | size | return |
---|---|---|---|---|---|
Larry | 60 | male | livestock | 800 | 90 |
Gaby | 50 | female | dairy | 600 | 90 |
Ruby | 58 | female | livestock | 700 | 95 |
1.3.7.2 count()
count()
counts the number of observations for each category.
count(farm) # count the number of observations
n |
---|
6 |
count(farm, type) # count observations per type of farm
type | n |
---|---|
crop | 2 |
dairy | 1 |
livestock | 2 |
urban | 1 |
count(farm, type, order = TRUE) # add argument for order
type | order | n |
---|---|---|
crop | TRUE | 2 |
dairy | TRUE | 1 |
livestock | TRUE | 2 |
urban | TRUE | 1 |
count(farm, type, wt = return, sort = TRUE) # add argument for weight
type | n |
---|---|
livestock | 185 |
crop | 130 |
dairy | 90 |
urban | 50 |
1.4 Export and Import Data
This section introduces functions in base R allowing you to export your data for later usage or import your saved data. To learn more about import/export data, check out this link.
1.4.1 RData Format
### export
save(farm, file = "farm.Rdata") # save to the current working directory
# specify the file path if you wish to save to a different location
### import
load("farm.Rdata") # load from the current working directory
# specify the file path if your file is loaded
1.4.2 csv Format
### export
write.csv(farm, "farm.csv")
### import
<- read.csv("farm.csv") farm
1.4.3 Other Format
If you are working with SPSS, Stata or SAS data files, haven is a good package for importing and exporting files of those formats.
A handy trick to import data interactively, without the need of specifying a path, try read.csv(file.choose())
.
1.5 Useful Resources
1.5.1 dplyr Cheat Sheet
Click here for more information
1.5.2 R for Data Science
See Chapter 5 of R for Data Science, by Wickham, H., & Grolemund, G.
1.6 Exercise
Part A
Continue from the farm business dataset “farm” used in Lab 2, work through the exercises below.
Create a new variable called size3 that meets the following criteria:
size3 = “small” if size <= 200
size3 = “medium” if 200 < size <= 600
size3 = “big” if size > 600
Finally, convert size3 to a factor variable that is ordinal from “small” to “big”.
Generate the following summary statistics, for each type of the farms:
the sum of all returns, called tot.return
the average returns, called avg.return
Finally, rearrange the data based on the value of avg.return, in the descending order.
Part B
Import data “mpg” and work through the coding below.
Drop the variables displ, drv and fl, then exclude cars that were manufactured by Hyundai and Pontiac .
Continue from your dataframe above and generate the summary statistics, for each manufacturer, model, and year:
the average of “cty”, called avg.cty
the average of “hwy”, called avg.hwy
the total number of cars produced, called tot.cars
2 Data Visualization using ggplot
library(tidyverse)
library(gapminder) # for additional data
library(patchwork) # optional, used to show graphs side by side
2.1 Introduction to ggplot2
ggplot2
is a plotting package that provides powerful commands to create graphs from data in a data frame. It offers a more programmatic interface for specifying which variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to switch from a bar plot to a scatterplot. This helps in creating publication-quality plots with minimal adjustments and tweaking. Reference.
- The “gg” here refers to “grammar of graphics”.
- Every graph consists of one or more geometric layers.
For demonstration, we will use the built-in data set, mpg
, first.
data(mpg)
head(mpg)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
2.2 Layered Grammar of Graphics
For our illustration of functions in ggplot2 in Lab 2, the layered grammar of graphics follows the template below. We will go through them step by step in the following sections.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
<FACET_FUNCTION> +
<SCALE_FUNCTION> +
<LABS_FUNCTION> +
<THEME_FUNCTION>
2.3 Layers in ggplot2
2.3.1 Geometric Layers
2.3.1.1 Commonly Used geom
Functions
Below is a list of commonly used geom
functions, we will explore all of them in the rest of this section:
geom_point()
: creates scatterplotsgeom_line()
: creates line plotsgeom_bar()
: creates bar charts of countsgeom_col()
: creates bar charts of valuesgeom_boxplot()
: shows distributions and outliers with boxplotsgeom_smooth()
: adds a fitted trend linegeom_jitter()
: aids the visualization of points by adding “jitter” to their positions
# create a scatter plot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# add another layer
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "red") + # you can also request a specific color
geom_smooth(mapping = aes(x = displ, y = hwy))
The geom_xxx()
functions can inherit both the data and aesthetic mappings from the top level of the plot, due to the argument inherit.aes = TRUE
by default (as specified in the R Documentation). As a result, you can simplify your code as follows:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "red") +
geom_smooth()
2.3.1.2 Aesthetic Mapping
Recall our previous code,
Aesthetics in geom_xxx()
statement can be specified in two ways:
inside the
aes()
function, which maps variables to aesthetics to represent or enhance visual features.outside the
aes()
function, which takes fixed values. This step is usually optional.
geom_xxx(aes(ARGUMENTS = variable, ...), ARGUMENTS = fixed values)
. Some commonly used aesthetics are:
x
,y
: define the variables for the x- and y-axes (must be insideaes()
).color
: defines the color of lines and strokes.fill
: defines the color inside areas of geoms.shape
: defines the symbols of points.size
: defines the size of points.alpha
: defines the opacity of geoms.
The examples below show the difference between mapping variables and mapping fixed values to aesthetics.
<- ggplot(data = mpg) +
p1 geom_point(mapping = aes(x = displ, y = hwy, color = drv)) # map variable to color
<- ggplot(data = mpg) +
p2 geom_point(mapping = aes(x = displ, y = hwy), color = "red") # color now is mapped by a fixed value
+ p2 # enabled by "patchwork" p1
<- ggplot(data = mpg) +
p3 geom_point(mapping = aes(x = displ, y = hwy, shape = drv)) # map variable to shape
<- ggplot(data = mpg) +
p4 geom_point(mapping = aes(x = displ, y = hwy), shape = 2) # shape now is mapped by a fixed value
+ p4 p3
<- ggplot(data = mpg) +
p5 geom_point(mapping = aes(x = displ, y = hwy, size = drv)) # map variable to size
<- ggplot(data = mpg) +
p6 geom_point(mapping = aes(x = displ, y = hwy), size = 3) # size now is mapped by a fixed value
+ p6 p5
<- ggplot(data = mpg) +
p7 geom_point(mapping = aes(x = displ, y = hwy, alpha = drv)) # map variable to alpha
<- ggplot(data = mpg) +
p8 geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.1) # alpha now is mapped by a fixed value
+ p8 p7
<- ggplot(data = mpg) +
p9 geom_bar(mapping = aes(x = class, fill = drv)) # map variable class to fill
<- ggplot(data = mpg) +
p10 geom_bar(mapping = aes(x = class), fill = "red") # fill now is mapped by a fixed value
+ p10 p9
2.3.1.3 Commonly Used Fixed Values
As shown above, fixed values mapped to aesthetics are usually numbers or strings. Below are some commonly used fixed values for geom
functions:
color
andfill
: see R color cheatsheet.linetype
andshape
: see Cookbook for Rsize
: takes numeric values; larger values correspond to larger sizes.alpha
: takes values between 0 and 1; larger values correspond to less transparency.Some extra examples are shown below
<- mpg %>% group_by(class, year) %>% summarize(mean_hwy = mean(hwy))
mpg_class_year_hwy
<- ggplot(data = mpg_class_year_hwy, aes(year, mean_hwy, color = class)) +
p11 geom_line(size = 2, linetype = "dotdash")
<- ggplot(data = mpg_class_year_hwy, aes(year, mean_hwy, color = class)) +
p12 geom_line(size = 2, linetype = 4) +
geom_point(size = 4, shape = 15)
+ p12 p11
<- mpg %>% group_by(class) %>% summarize(mean_hwy = mean(hwy))
mpg_class_hwy
<- ggplot(data = mpg_class_hwy, mapping = aes(x = class, y = mean_hwy)) +
p13 geom_col(fill = "lightblue1")
<- ggplot(data = mpg_class_hwy, mapping = aes(x = class, y = mean_hwy)) +
p14 geom_col(fill = "lightblue3")
+ p14 p13
<- ggplot(data = mpg, aes(x = drv, y = hwy, color = drv)) +
p15 geom_boxplot()
<- ggplot(data = mpg, aes(x = drv, y = hwy, color = drv)) +
p16 geom_boxplot() +
geom_jitter(width = 0.333)
+ p16 p15
2.3.2 Facets
facet_wrap()
: partitions a plot into a matrix of panels, typically based on the values of one faceting variable. Each panel shows a different subset of the data.
facet_grid()
: partitions a plot into a matrix of panels, based on the combination of two faceting variables.
Following the previous practice, let us create a line plot to show the trend of average fuel efficiency for each manufacturer.
<- mpg %>% group_by(manufacturer, year) %>% summarize(mean_hwy = mean(hwy))
mpg_mfr_year_hwy
<- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
p17 geom_line(size = 2)
<- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
p18 geom_line(size = 2) +
facet_wrap(~ manufacturer)
+ p18 p17
As you can see in “p17”, the graph is not effective in showing the trend in fuel efficiency across time for some of the manufacturers. “p18” partitions the plot into subplots for each individual manufacturer. Since all panels share the same scale for “mean_hwy”, the graph is still not effective. To improve, you can add scales = "free"
to allow different scales for subplots.
<- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
p19 geom_line(size = 2) +
facet_wrap(~ manufacturer, scales = "free")
+ p19 p18
Now, let’s see how facet_wrap()
differs from facet_grid()
, starting with graph(s) showing “hwy” vs. “displ”.
<- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
p20 geom_point() +
geom_smooth()
<- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
p21 geom_point() +
geom_smooth() +
facet_wrap(~drv)
+ p21 p20
As shown below, facet_wrap()
shows the relationship for each value of “drv”, while facet_grid()
shows the relationship for each combinations of the values of “drv” and fl.
<- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
p22 geom_point() +
geom_smooth() +
facet_wrap(~drv, nrow = 3, strip.position = "right") # adding nrow and strip.position for better visualization
<- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
p23 geom_point() +
geom_smooth() +
facet_grid(drv ~ fl)
+ p23 p22
2.3.3 Scales
2.3.3.1 Scales of Axes
This section explores some common functions that scale the axes of your graph.
When mapping discrete variables: the default functions are
scale_x_discrete()
for the x-axis, andscale_y_discrete()
for the y-axis.When mapping continuous variables: the default functions are
scale_x_continuous()
for the x-axis, andscale_y_continuous()
for the y-axis.- Built-in functions like
scale_x_log10()
,scale_x_sqrt()
, andscale_x_reverse()
provide easy access to common transformations: base-10 logarithm, square root, and reversed order.
- Built-in functions like
For this section, let’s use a new dataset from the package gapminder. See how arguments limits
controls the min/max of the axes, and breaks
displays ticks only at specified values.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
<- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
base_plot geom_point()
<- base_plot +
scale_x_y scale_x_continuous(limits = c(0, 50000), breaks = c(10000, 25000, 50000)) +
scale_y_continuous(limits = c(40, 70), breaks = c(40, 50, 70))
+ scale_x_y base_plot
<- base_plot + scale_x_log10()
built_in_log10
<- ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp, color = continent)) +
manual_log10 geom_point()
+ manual_log10 built_in_log10
<- base_plot + scale_x_reverse() + scale_y_reverse()
base_plot_reverse
+ base_plot_reverse base_plot
2.3.3.2 Scales of Colors
Recall you can set the color of your graphs using the aesthetics color
and/or fill
.
# building a simple dataframe
<- data.frame(x = c("a", "b", "c", "d"), y = c(1, 2, 3, 4))
mydata mydata
x | y |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
<- ggplot(mydata, aes(x = x, y = y, color = x)) + geom_col()
bar_color <- ggplot(mydata, aes(x = x, y = y, fill = x)) + geom_col()
bar_fill
+ bar_fill bar_color
<- bar_color + scale_color_discrete() # since x is discrete now
bar_color_scale <- bar_fill + scale_fill_discrete()
bar_fill_scale
+ bar_fill_scale # no changes since we are using defaults bar_color_scale
To assign different colors to different values of “x”, we can utilize the function scale_fill_brewer()
, which uses the color palettes from the package RColorBrewer without the need of installing and loading the package. To see all the colors, check this link or type RColorBrewer::display.brewer.all()
.
<- bar_fill + scale_fill_brewer(palette = "OrRd")
bar_fill_brewer_1
<- bar_fill + scale_fill_brewer(palette = "BrBg")
bar_fill_brewer_2
+ bar_fill_brewer_2 bar_fill_brewer_1
2.3.4 Labs and Themes
To modify elements of a plot other than the data, such as axes, legend, or title, use labs()
and theme()
. See how the example below applies changes to the graph “bar_fill” on axis, legend and plot title.
<- bar_fill + labs(x = "Letters", y = "Numbers", fill = "Legend", title = "bar_fill Modified") +
bar_fill_mod theme(
axis.title = element_text(size = 24),
axis.text = element_text(size = 20),
axis.text.x = element_text(angle = 45),
legend.title = element_text(size = 16),
legend.text = element_text(size = 12),
legend.position = "bottom",
plot.title = element_text(size = 30, face = "bold")
)
+ bar_fill_mod bar_fill
2.3.5 Built-in Themes
There are a number of built-in themes come with ggplot2 that you can use without the need to specify every element of your graphs. The default theme of ggplot2 is theme_grey()
. Let’s see some examples on our base plot below.
<- ggplot(data = mpg, aes(x = displ, y = hwy)) +
base_theme geom_point(color = "red") +
geom_smooth() +
labs(x = "Engine Displacement", y = "Highway Mileage", title = "Fuel Efficiency")
base_theme
<- base_theme + theme_grey() + labs(title = "theme_grey()")
base_theme_default
<- base_theme + theme_bw() + labs(title = "theme_bw()")
base_theme1
<- base_theme + theme_classic() + labs(title = "theme_classic()")
base_theme2
<- base_theme + theme_minimal() + labs(title = "theme_minimal()")
base_theme3
<- base_theme + theme_linedraw() + labs(title = "theme_linedraw()")
base_theme4
<- base_theme + theme_light() + labs(title = "theme_light()")
base_theme5
<- base_theme + theme_dark() + labs(title = "theme_dark()")
base_theme6
<- base_theme + theme_void() + labs(title = "theme_void()")
base_theme7
+ base_theme1 + base_theme2 + base_theme3 +
(base_theme_default + base_theme5 + base_theme6 + base_theme7) +
base_theme4 plot_layout(ncol = 4, nrow = 2)
Extra themes and scales can be acquired by installing package ggthemes. To see the complete list, visit this webiste.
2.4 Annotating and Saving Plots
Sometimes, you may wish to add text in your graphs to highlight certain elements. Take “base_plot” in Section 2.3.3 for example, say you wish to highlight countries with high GDP per capita (above $50,000) but low life expectancy (below 70 years old).
<- gapminder %>% filter(gdpPercap > 50000 & lifeExp < 70) # create the data that's to be highlighted in the graph
h_country
<- base_plot +
base_plot_text geom_text(data = h_country, aes(label = country), size = 4, vjust = 1.5, show.legend = FALSE) +
geom_text(data = h_country, aes(label = year), size = 4, vjust = 2.75, show.legend = FALSE)
+ base_plot_text base_plot
To save a created plot, you can use the function ggsave()
as below.
ggsave("gdp_lifeExp.png", plot = base_plot_text, width = 10, height = 7) # save to your current working directory
# you can save ggplot as one of "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf"
# you can also use different units = c("in", "cm", "mm", "px"),
# ggsave("myplot.pdf", width = 20, height = 20, units = "cm")
2.5 Useful resources
2.5.1 ggplot2 Cheat Sheet
Click here for more information
2.5.2 R for Data Science
See Chapter 3 and 28 of R for Data Science, by Wickham, H., & Grolemund, G.
2.5.3 ggplot2: Elegant Graphics for Data Analysis
See the third edition of ggplot2: Elegant Graphics for Data Analysis, by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen.
2.6 Exercise
Part A
Import data “mpg” and work through the exercises below.
Create a scatterplot that shows all of the following:
use “cty” on the x-axis and “hwy” on the y-axis
assign different colors for “class” and different shapes for “year”.
(tip: since continuous variables cannot be mapped toshape
, you will need to convert “year” to a factor usingfactor()
)fit only one trend line for all points
Create a scatterplot that shows all of the following:
use “displ” on the x-axis and “hwy” on the y-axis
partition the graph into a grid by the combination of “drv” and “fl”, but only include “fl” that equals “p” or “r”
fit a trend line in each graph within the grid
(Tip: you can build from the example “p23” in Lab 2.)
Create a bar chart that shows the average “hwy” of all cars produced in 2008 by each of the following manufacturers:
Audi, Hyundai, Nissan and Volkswagen. Assign a different color to each manufacturer.
Part B
Import the data “gapminder” and try to reproduce the graph below.