Lab 2: Data Manipulation and Visualization

Author

Feng Qiu, Liyuan Xuan

Published

September 22, 2025

In Lab 1, we briefly introduced what packages are in R and one specific package tidyverse. If you wish to learn more about tidyverse, click here for more information. Lab 2 will focus on two packages that are included in tidyverse:

dplyr for data manipulation
ggplot for data visualization

But first, remember to load the package.

# install.packages("tidyverse")     # install if needed
library(tidyverse)

1 Data Manipulation using dplyr

1.1 What is a Tidy Data Set?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Three rules make a data tidy:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

1.2 Create a Farm Business Data Set

# farmers' info
name <- c("Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby")
sex <- c("male", "male", "male", "female", "female", "female")
age <- c(43, 60, 25, 50, 28, 58)

# types of farm
type <- c("crop", "livestock", "urban", "dairy", "crop", "livestock")

# size of farm in acres
size <- c(550, 800, 10, 600, 1000, 700)

# net annual cash return from ag businesses, in $1000
return <- c(40, 90, 50, 90, 90, 95)

# combine the variables together as a data frame
farm <- data.frame(name, age, sex, type, size, return)
farm

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Alex	25	male	urban	10	50
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Ruby	58	female	livestock	700	95

# glimpse the data set
glimpse(farm)

Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <dbl> 43, 60, 25, 50, 28, 58
$ sex    <chr> "male", "male", "male", "female", "female", "female"
$ type   <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95

1.3 Important Functions in dplyr

There are six important functions in dplyr are:

select(): pick variables by their names
filter(): pick observations by their values
arrange(): reorder the rows
mutate(): create new variables with functions of existing variables
summarize(): collapse many values down to a single summary
group_by(): groups data by one or more variables, allowing subsequent operations to be applied independently to each group

Combining with the pipe operator %>%, dplyr can make data manipulation simple and intuitive.

Tip

You can always type “?FUNCTION_NAME” in the Console pane to check the R Documentation for the function. Try ?select.

1.3.1 `select()`

select() allows you to focus on the variables you’re interested in.

select(farm, c(type, size, return))     # select farm type, size and return

type	size	return
crop	550	40
livestock	800	90
urban	10	50
dairy	600	90
crop	1000	90
livestock	700	95

select(farm, sex:size)      # select everything between sex and size

sex	type	size
male	crop	550
male	livestock	800
male	urban	10
female	dairy	600
female	crop	1000
female	livestock	700

select(farm, -name)     # select everything but names

age	sex	type	size	return
43	male	crop	550	40
60	male	livestock	800	90
25	male	urban	10	50
50	female	dairy	600	90
28	female	crop	1000	90
58	female	livestock	700	95

1.3.2 `filter()`

filter() allows you to subset observations based on their values.

filter(farm, size > 500)      # select farms with size > 500

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Ruby	58	female	livestock	700	95

filter(farm, size > 500 & sex == "female")      # select farms with size > 500 AND owned by female farmers

name	age	sex	type	size	return
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Ruby	58	female	livestock	700	95

1.3.3 `arrange()`

arrange() orders the observations by one or more variables. Basically, it changes the order of rows.

arrange(farm, size)      # order the data set by farm size, by default, in ascending order

name	age	sex	type	size	return
Alex	25	male	urban	10	50
Henry	43	male	crop	550	40
Gaby	50	female	dairy	600	90
Ruby	58	female	livestock	700	95
Larry	60	male	livestock	800	90
Amy	28	female	crop	1000	90

arrange(farm, desc(size))      # change the ordering to descending

name	age	sex	type	size	return
Amy	28	female	crop	1000	90
Larry	60	male	livestock	800	90
Ruby	58	female	livestock	700	95
Gaby	50	female	dairy	600	90
Henry	43	male	crop	550	40
Alex	25	male	urban	10	50

1.3.4 `mutate()`

mudate() modifies existing variables or adds new variables.

mutate(farm, return = return * 1000)

name	age	sex	type	size	return
Henry	43	male	crop	550	40000
Larry	60	male	livestock	800	90000
Alex	25	male	urban	10	50000
Gaby	50	female	dairy	600	90000
Amy	28	female	crop	1000	90000
Ruby	58	female	livestock	700	95000

mutate(farm, age.sq = age ^ 2)

name	age	sex	type	size	return	age.sq
Henry	43	male	crop	550	40	1849
Larry	60	male	livestock	800	90	3600
Alex	25	male	urban	10	50	625
Gaby	50	female	dairy	600	90	2500
Amy	28	female	crop	1000	90	784
Ruby	58	female	livestock	700	95	3364

mutate(farm, per.acre.return = return / size)

name	age	sex	type	size	return	per.acre.return
Henry	43	male	crop	550	40	0.0727273
Larry	60	male	livestock	800	90	0.1125000
Alex	25	male	urban	10	50	5.0000000
Gaby	50	female	dairy	600	90	0.1500000
Amy	28	female	crop	1000	90	0.0900000
Ruby	58	female	livestock	700	95	0.1357143

# Or, you can do all three in one step
mutate(farm,
  return = return * 1000, 
  age.sq = age ^ 2,
  per.acre.return = return / size
)

name	age	sex	type	size	return	age.sq	per.acre.return
Henry	43	male	crop	550	40000	1849	72.72727
Larry	60	male	livestock	800	90000	3600	112.50000
Alex	25	male	urban	10	50000	625	5000.00000
Gaby	50	female	dairy	600	90000	2500	150.00000
Amy	28	female	crop	1000	90000	784	90.00000
Ruby	58	female	livestock	700	95000	3364	135.71429

# change the classes of variables
glimpse(farm)      # view the data before changes

Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <dbl> 43, 60, 25, 50, 28, 58
$ sex    <chr> "male", "male", "male", "female", "female", "female"
$ type   <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95

farm2 <- mutate(farm,
                sex = as.factor(sex), 
                type = as.factor(type), 
                age = as.integer(age)
         )
glimpse(farm2)       # view the data after changes

Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <int> 43, 60, 25, 50, 28, 58
$ sex    <fct> male, male, male, female, female, female
$ type   <fct> crop, livestock, urban, dairy, crop, livestock
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95

The function else() is often used in data manipulation, which assigns values to a variable based on whether a condition is satisfied.

mutate(farm,
       size2 = ifelse(size > 600, "big", "small"),   
       dummy_urban = ifelse(type == "urban", 1, 0)      # when testing for equality, use double ==
)

name	age	sex	type	size	return	size2	dummy_urban
Henry	43	male	crop	550	40	small	0
Larry	60	male	livestock	800	90	big	0
Alex	25	male	urban	10	50	small	1
Gaby	50	female	dairy	600	90	small	0
Amy	28	female	crop	1000	90	big	0
Ruby	58	female	livestock	700	95	big	0

1.3.5 `summarize()`

summarize() provides summary statistics, which always produce one single row if there are no grouping variables.

summarize(farm, tot.return = sum(return))

tot.return
455

summarize(farm, avg.return = mean(return))

avg.return
75.83333

summarize(farm,
          youngest = min(age),
          oldest = max(age),
          median = median(age),
          cor.size.return = cor(size, return))

youngest	oldest	median	cor.size.return
25	60	46.5	0.6787267

Tip

It is often the case that we wish to know the summary statstics by a certain groups, e.g. average return by gender. Therefore, the use of summarize() is usually combined with group_by() and the pipe operator %>%.

1.3.6 `group_by()` and `%>%`

1.3.6.1 `group_by()`

group_by() groups data by named variables, the use of group_by() itself does not change any variables, but only re-order the data, simlar to arrange().

group_by(farm, sex)

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Alex	25	male	urban	10	50
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Ruby	58	female	livestock	700	95

1.3.6.2 `%>%`

However, then main purpose of group_by() is to group your data to perform following operation. To achieve this, you will also need the pipe operator %>%. Functioning like pipes, %>% uses the output of one function as the input to the next function.

Suppose you wish to perform the steps below, based on the data farm:

calculate the return per acre, called “per.acre.return”
keep only the farms that are owned by farmers above 40 years old
create a new data frame only contains: names and age of the farmers, and the return per acre

### without %>%
farm_wo_pipe1 <- mutate(farm, per.acre.return = return / size)
farm_wo_pipe2 <- filter(farm_wo_pipe1, age > 40)
farm_wo_pipe3 <- select(farm_wo_pipe2, c(name, age, per.acre.return))
farm_wo_pipe3

name	age	per.acre.return
Henry	43	0.0727273
Larry	60	0.1125000
Gaby	50	0.1500000
Ruby	58	0.1357143

### with %>%

farm_w_pipe <- farm %>% mutate(per.acre.return = return / size) %>%
                        filter(age > 40) %>%
                        select(name, age, per.acre.return)
farm_w_pipe

name	age	per.acre.return
Henry	43	0.0727273
Larry	60	0.1125000
Gaby	50	0.1500000
Ruby	58	0.1357143

1.3.6.3 Combining `group_by()` with `%>%`

Now, let’s calculate summary statistics by groups, using group_by() with %>%.

farm %>% group_by(sex) %>% summarize(num.farmer = n(),
                                     youngest = min(age),
                                     oldest = max(age),  
                                     
                                     tot.return = sum(return),
                                     avg.return = mean(return),
                                     avg.per.acre.return = mean(return/size),
                                     avg.size = mean(size))

sex	num.farmer	youngest	oldest	tot.return	avg.return	avg.per.acre.return	avg.size
female	3	28	58	275	91.66667	0.1252381	766.6667
male	3	25	60	180	60.00000	1.7284091	453.3333

1.3.7 Other Functions/Verbs

1.3.7.1 `slice()` and Its Variants

You can use slice() to select rows by position, or it variants

slice_head() and slice_tail(): to select first/last rows
slice_min() and slice_max(): to select rows with minimum/maximum values
slice_sample(): to select random samples

farm

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Alex	25	male	urban	10	50
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Ruby	58	female	livestock	700	95

farm %>% slice(3)     # pick the observation in row 3

name	age	sex	type	size	return
Alex	25	male	urban	10	50

farm %>% slice(1:3)     # pick observations from row 1 through row 3

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Alex	25	male	urban	10	50

farm %>% slice_head(n = 3)      # pick first 3 rows, slice_tail would pick the last 3 rows

name	age	sex	type	size	return
Henry	43	male	crop	550	40
Larry	60	male	livestock	800	90
Alex	25	male	urban	10	50

farm %>% slice_min(age, n = 3)      # pick 3 rows with the youngest ages, slice_max would pick 3 rows with the largest ages

name	age	sex	type	size	return
Alex	25	male	urban	10	50
Amy	28	female	crop	1000	90
Henry	43	male	crop	550	40

farm %>% slice_sample(n = 3)      # randomly pick 3 observations

name	age	sex	type	size	return
Gaby	50	female	dairy	600	90
Amy	28	female	crop	1000	90
Henry	43	male	crop	550	40

farm %>% slice_sample(prop = 0.5)     # randomly pick 50% of the data

name	age	sex	type	size	return
Larry	60	male	livestock	800	90
Gaby	50	female	dairy	600	90
Ruby	58	female	livestock	700	95

1.3.7.2 `count()`

count() counts the number of observations for each category.

count(farm)     # count the number of observations

n
6

count(farm, type)     # count observations per type of farm

type	n
crop	2
dairy	1
livestock	2
urban	1

count(farm, type, order = TRUE)     # add argument for order

type	order	n
crop	TRUE	2
dairy	TRUE	1
livestock	TRUE	2
urban	TRUE	1

count(farm, type, wt = return, sort = TRUE)     # add argument for weight

type	n
livestock	185
crop	130
dairy	90
urban	50

1.4 Export and Import Data

This section introduces functions in base R allowing you to export your data for later usage or import your saved data. To learn more about import/export data, check out this link.

1.4.1 RData Format

### export 
save(farm, file = "farm.Rdata")     # save to the current working directory
# specify the file path if you wish to save to a different location

### import
load("farm.Rdata")      # load from the current working directory
# specify the file path if your file is loaded

1.4.2 csv Format

### export
write.csv(farm, "farm.csv")

### import
farm <- read.csv("farm.csv")

1.4.3 Other Format

If you are working with SPSS, Stata or SAS data files, haven is a good package for importing and exporting files of those formats.

Tip

A handy trick to import data interactively, without the need of specifying a path, try read.csv(file.choose()).

1.5 Useful Resources

1.5.1 dplyr Cheat Sheet

Click here for more information

1.5.2 R for Data Science

See Chapter 5 of R for Data Science, by Wickham, H., & Grolemund, G.

1.6 Exercise

Part A

Continue from the farm business dataset “farm” used in Lab 2, work through the exercises below.

Create a new variable called size3 that meets the following criteria:
1. size3 = “small” if size <= 200
2. size3 = “medium” if 200 < size <= 600
3. size3 = “big” if size > 600
Finally, convert size3 to a factor variable that is ordinal from “small” to “big”.
Generate the following summary statistics, for each type of the farms:
1. the sum of all returns, called tot.return
2. the average returns, called avg.return
Finally, rearrange the data based on the value of avg.return, in the descending order.

Part B

Import data “mpg” and work through the coding below.

Drop the variables displ, drv and fl, then exclude cars that were manufactured by Hyundai and Pontiac .
Continue from your dataframe above and generate the summary statistics, for each manufacturer, model, and year:
1. the average of “cty”, called avg.cty
2. the average of “hwy”, called avg.hwy
3. the total number of cars produced, called tot.cars

2 Data Visualization using ggplot

library(tidyverse) 
library(gapminder) # for additional data
library(patchwork) # optional, used to show graphs side by side

2.1 Introduction to ggplot2

ggplot2 is a plotting package that provides powerful commands to create graphs from data in a data frame. It offers a more programmatic interface for specifying which variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to switch from a bar plot to a scatterplot. This helps in creating publication-quality plots with minimal adjustments and tweaking. Reference.

The “gg” here refers to “grammar of graphics”.
Every graph consists of one or more geometric layers.

For demonstration, we will use the built-in data set, mpg, first.

data(mpg)     
head(mpg)

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

2.2 Layered Grammar of Graphics

For our illustration of functions in ggplot2 in Lab 2, the layered grammar of graphics follows the template below. We will go through them step by step in the following sections.

 ggplot(data = <DATA>) + 
     <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
     <FACET_FUNCTION> +
     <SCALE_FUNCTION> +
     <LABS_FUNCTION> +
     <THEME_FUNCTION>

2.3 Layers in ggplot2

2.3.1 Geometric Layers

2.3.1.1 Commonly Used `geom` Functions

Below is a list of commonly used geom functions, we will explore all of them in the rest of this section:

geom_point(): creates scatterplots
geom_line(): creates line plots
geom_bar(): creates bar charts of counts
geom_col(): creates bar charts of values
geom_boxplot(): shows distributions and outliers with boxplots
geom_smooth(): adds a fitted trend line
geom_jitter(): aids the visualization of points by adding “jitter” to their positions

# create a scatter plot  
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))

# add another layer
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = "red") +      # you can also request a specific color
    geom_smooth(mapping = aes(x = displ, y = hwy))

The geom_xxx() functions can inherit both the data and aesthetic mappings from the top level of the plot, due to the argument inherit.aes = TRUE by default (as specified in the R Documentation). As a result, you can simplify your code as follows:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
    geom_point(color = "red") +
    geom_smooth()

2.3.1.2 Aesthetic Mapping

Recall our previous code,

Aesthetics in geom_xxx() statement can be specified in two ways:

inside the aes() function, which maps variables to aesthetics to represent or enhance visual features.
outside the aes() function, which takes fixed values. This step is usually optional.

geom_xxx(aes(ARGUMENTS = variable, ...), ARGUMENTS = fixed values). Some commonly used aesthetics are:

x, y: define the variables for the x- and y-axes (must be inside aes()).
color: defines the color of lines and strokes.
fill: defines the color inside areas of geoms.
shape: defines the symbols of points.
size: defines the size of points.
alpha: defines the opacity of geoms.

The examples below show the difference between mapping variables and mapping fixed values to aesthetics.

p1 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, color = drv)) # map variable to color 
  
p2 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), color = "red") # color now is mapped by a fixed value 

p1 + p2 # enabled by "patchwork"

p3 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, shape = drv))  # map variable to shape 
  
p4 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), shape = 2)  # shape now is mapped by a fixed value 

p3 + p4

p5 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, size = drv))  # map variable to size 
  
p6 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), size = 3)  # size now is mapped by a fixed value 

p5 + p6

p7 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, alpha = drv))  # map variable to alpha 
  
p8 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.1)  # alpha now is mapped by a fixed value 

p7 + p8

p9 <- ggplot(data = mpg) +
        geom_bar(mapping = aes(x = class, fill = drv))  # map variable class to fill
  
p10 <- ggplot(data = mpg) +
       geom_bar(mapping = aes(x = class), fill = "red")  # fill now is mapped by a fixed value 

p9 + p10

2.3.1.3 Commonly Used Fixed Values

As shown above, fixed values mapped to aesthetics are usually numbers or strings. Below are some commonly used fixed values for geom functions:

color and fill: see R color cheatsheet.
linetype and shape: see Cookbook for R
size: takes numeric values; larger values correspond to larger sizes.
alpha: takes values between 0 and 1; larger values correspond to less transparency.
Some extra examples are shown below

mpg_class_year_hwy <- mpg %>% group_by(class, year) %>% summarize(mean_hwy = mean(hwy))

p11 <- ggplot(data = mpg_class_year_hwy, aes(year, mean_hwy, color = class)) +
        geom_line(size = 2, linetype = "dotdash")   
  
p12 <- ggplot(data = mpg_class_year_hwy, aes(year, mean_hwy, color = class)) +
        geom_line(size = 2, linetype = 4) +
        geom_point(size = 4, shape = 15)

p11 + p12

mpg_class_hwy <- mpg %>% group_by(class) %>% summarize(mean_hwy = mean(hwy))

p13 <- ggplot(data = mpg_class_hwy, mapping = aes(x = class, y = mean_hwy)) +
        geom_col(fill = "lightblue1")  
  
p14 <- ggplot(data = mpg_class_hwy, mapping = aes(x = class, y = mean_hwy)) +
        geom_col(fill = "lightblue3")   

p13 + p14

p15 <- ggplot(data = mpg, aes(x = drv, y = hwy, color = drv)) +
        geom_boxplot() 
  
p16 <- ggplot(data = mpg, aes(x = drv, y = hwy, color = drv)) +
        geom_boxplot() + 
        geom_jitter(width = 0.333) 

p15 + p16

2.3.2 Facets

facet_wrap(): partitions a plot into a matrix of panels, typically based on the values of one faceting variable. Each panel shows a different subset of the data.

facet_grid(): partitions a plot into a matrix of panels, based on the combination of two faceting variables.

Following the previous practice, let us create a line plot to show the trend of average fuel efficiency for each manufacturer.

mpg_mfr_year_hwy <- mpg %>% group_by(manufacturer, year) %>% summarize(mean_hwy = mean(hwy))
 
p17 <- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
        geom_line(size = 2)
  
p18 <- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
        geom_line(size = 2) + 
        facet_wrap(~ manufacturer) 

p17 + p18

As you can see in “p17”, the graph is not effective in showing the trend in fuel efficiency across time for some of the manufacturers. “p18” partitions the plot into subplots for each individual manufacturer. Since all panels share the same scale for “mean_hwy”, the graph is still not effective. To improve, you can add scales = "free" to allow different scales for subplots.

p19 <- ggplot(data = mpg_mfr_year_hwy, aes(year, mean_hwy, color = manufacturer)) +
        geom_line(size = 2) + 
        facet_wrap(~ manufacturer, scales = "free")  

p18 + p19

Now, let’s see how facet_wrap() differs from facet_grid(), starting with graph(s) showing “hwy” vs. “displ”.

p20 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
        geom_point() +
        geom_smooth()

p21 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
        geom_point() +
        geom_smooth() +
        facet_wrap(~drv)

p20 + p21

As shown below, facet_wrap() shows the relationship for each value of “drv”, while facet_grid() shows the relationship for each combinations of the values of “drv” and fl.

p22 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
        geom_point() +
        geom_smooth() +
        facet_wrap(~drv, nrow = 3, strip.position = "right") # adding nrow and strip.position for better visualization

p23 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
        geom_point() +
        geom_smooth() +
        facet_grid(drv ~ fl)

p22 + p23

2.3.3 Scales

2.3.3.1 Scales of Axes

This section explores some common functions that scale the axes of your graph.

When mapping discrete variables: the default functions are scale_x_discrete() for the x-axis, and scale_y_discrete() for the y-axis.
When mapping continuous variables: the default functions are scale_x_continuous() for the x-axis, and scale_y_continuous() for the y-axis.
- Built-in functions like scale_x_log10(), scale_x_sqrt(), and scale_x_reverse() provide easy access to common transformations: base-10 logarithm, square root, and reversed order.

For this section, let’s use a new dataset from the package gapminder. See how arguments limits controls the min/max of the axes, and breaks displays ticks only at specified values.

glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

base_plot <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
              geom_point() 

scale_x_y <- base_plot + 
              scale_x_continuous(limits = c(0, 50000), breaks = c(10000, 25000, 50000)) +
              scale_y_continuous(limits = c(40, 70), breaks = c(40, 50, 70))

base_plot + scale_x_y

built_in_log10 <- base_plot + scale_x_log10()

manual_log10 <- ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp, color = continent)) +
                  geom_point() 

built_in_log10 + manual_log10

base_plot_reverse <- base_plot + scale_x_reverse() + scale_y_reverse()
 
base_plot + base_plot_reverse

2.3.3.2 Scales of Colors

Recall you can set the color of your graphs using the aesthetics color and/or fill.

# building a simple dataframe
mydata <- data.frame(x = c("a", "b", "c", "d"), y = c(1, 2, 3, 4))
mydata

x	y
a	1
b	2
c	3
d	4

bar_color <- ggplot(mydata, aes(x = x, y = y, color = x)) + geom_col()
bar_fill  <- ggplot(mydata, aes(x = x, y = y, fill = x))  + geom_col()

bar_color + bar_fill

bar_color_scale <- bar_color + scale_color_discrete()  # since x is discrete now
bar_fill_scale  <- bar_fill  + scale_fill_discrete()  

bar_color_scale + bar_fill_scale # no changes since we are using defaults

To assign different colors to different values of “x”, we can utilize the function scale_fill_brewer(), which uses the color palettes from the package RColorBrewer without the need of installing and loading the package. To see all the colors, check this link or type RColorBrewer::display.brewer.all().

bar_fill_brewer_1 <- bar_fill + scale_fill_brewer(palette = "OrRd")

bar_fill_brewer_2 <- bar_fill + scale_fill_brewer(palette = "BrBg")

bar_fill_brewer_1 + bar_fill_brewer_2

2.3.4 Labs and Themes

To modify elements of a plot other than the data, such as axes, legend, or title, use labs() and theme(). See how the example below applies changes to the graph “bar_fill” on axis, legend and plot title.

bar_fill_mod <- bar_fill + labs(x = "Letters", y = "Numbers", fill = "Legend", title = "bar_fill Modified") +
                                theme(
                                  axis.title = element_text(size = 24),
                                  axis.text = element_text(size = 20),
                                  axis.text.x = element_text(angle = 45),
                                  
                                  legend.title = element_text(size = 16),
                                  legend.text = element_text(size = 12),
                                  legend.position = "bottom",
                                  
                                  plot.title = element_text(size = 30, face = "bold")
                                )
        

bar_fill + bar_fill_mod

2.3.5 Built-in Themes

There are a number of built-in themes come with ggplot2 that you can use without the need to specify every element of your graphs. The default theme of ggplot2 is theme_grey(). Let’s see some examples on our base plot below.

base_theme <- ggplot(data = mpg, aes(x = displ, y = hwy)) +
                geom_point(color = "red") +
                geom_smooth() +
                labs(x = "Engine Displacement", y = "Highway Mileage", title = "Fuel Efficiency") 
base_theme

base_theme_default <- base_theme + theme_grey() + labs(title = "theme_grey()")
        
base_theme1 <- base_theme + theme_bw() + labs(title = "theme_bw()")

base_theme2 <- base_theme + theme_classic() + labs(title = "theme_classic()")

base_theme3 <- base_theme + theme_minimal() + labs(title = "theme_minimal()")
        
base_theme4 <- base_theme + theme_linedraw() + labs(title = "theme_linedraw()")

base_theme5 <- base_theme + theme_light() + labs(title = "theme_light()")

base_theme6 <- base_theme + theme_dark() + labs(title = "theme_dark()")

base_theme7 <- base_theme + theme_void() + labs(title = "theme_void()")

(base_theme_default + base_theme1 + base_theme2 + base_theme3 +
 base_theme4 + base_theme5 + base_theme6 + base_theme7) + 
  plot_layout(ncol = 4, nrow = 2)

Extra themes and scales can be acquired by installing package ggthemes. To see the complete list, visit this webiste.

2.4 Annotating and Saving Plots

Sometimes, you may wish to add text in your graphs to highlight certain elements. Take “base_plot” in Section 2.3.3 for example, say you wish to highlight countries with high GDP per capita (above $50,000) but low life expectancy (below 70 years old).

h_country <- gapminder %>% filter(gdpPercap > 50000 & lifeExp < 70) # create the data that's to be highlighted in the graph

base_plot_text <- base_plot + 
                    geom_text(data = h_country, aes(label = country), size = 4, vjust = 1.5, show.legend = FALSE) + 
                    geom_text(data = h_country, aes(label = year), size = 4, vjust = 2.75, show.legend = FALSE) 

base_plot + base_plot_text

To save a created plot, you can use the function ggsave() as below.

ggsave("gdp_lifeExp.png", plot = base_plot_text, width = 10, height = 7)  # save to your current working directory

# you can save ggplot as one of "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" 
# you can also use different units = c("in", "cm", "mm", "px"),  
# ggsave("myplot.pdf", width = 20, height = 20, units = "cm")

2.5 Useful resources

2.6 Exercise

Part A

Import data “mpg” and work through the exercises below.

Create a scatterplot that shows all of the following:
1. use “cty” on the x-axis and “hwy” on the y-axis
2. assign different colors for “class” and different shapes for “year”.
  (tip: since continuous variables cannot be mapped to shape, you will need to convert “year” to a factor using factor())
3. fit only one trend line for all points
Create a scatterplot that shows all of the following:
1. use “displ” on the x-axis and “hwy” on the y-axis
2. partition the graph into a grid by the combination of “drv” and “fl”, but only include “fl” that equals “p” or “r”
3. fit a trend line in each graph within the grid
(Tip: you can build from the example “p23” in Lab 2.)
Create a bar chart that shows the average “hwy” of all cars produced in 2008 by each of the following manufacturers:
Audi, Hyundai, Nissan and Volkswagen. Assign a different color to each manufacturer.

Part B

Import the data “gapminder” and try to reproduce the graph below.

1 Data Manipulation using dplyr

1.1 What is a Tidy Data Set?

1.2 Create a Farm Business Data Set

1.3 Important Functions in dplyr

1.3.1 select()

1.3.2 filter()

1.3.3 arrange()

1.3.4 mutate()

1.3.5 summarize()

1.3.6 group_by() and %>%

1.3.6.1 group_by()

1.3.6.2 %>%

1.3.6.3 Combining group_by() with %>%

1.3.7 Other Functions/Verbs

1.3.7.1 slice() and Its Variants

1.3.7.2 count()

1.4 Export and Import Data

1.4.1 RData Format

1.4.2 csv Format

1.4.3 Other Format

1.5 Useful Resources

1.5.1 dplyr Cheat Sheet

1.5.2 R for Data Science

1.6 Exercise

Part A

Part B

2 Data Visualization using ggplot

2.1 Introduction to ggplot2

2.2 Layered Grammar of Graphics

2.3 Layers in ggplot2

2.3.1 Geometric Layers

2.3.1.1 Commonly Used geom Functions

2.3.1.2 Aesthetic Mapping

2.3.1.3 Commonly Used Fixed Values

2.3.2 Facets

2.3.3 Scales

2.3.3.1 Scales of Axes

2.3.3.2 Scales of Colors

2.3.4 Labs and Themes

2.3.5 Built-in Themes

2.4 Annotating and Saving Plots

2.5 Useful resources

2.5.1 ggplot2 Cheat Sheet

2.5.2 R for Data Science

2.5.3 ggplot2: Elegant Graphics for Data Analysis

2.6 Exercise

Part A

Part B

1.3.1 `select()`

1.3.2 `filter()`

1.3.3 `arrange()`

1.3.4 `mutate()`

1.3.5 `summarize()`

1.3.6 `group_by()` and `%>%`

1.3.6.1 `group_by()`

1.3.6.2 `%>%`

1.3.6.3 Combining `group_by()` with `%>%`

1.3.7.1 `slice()` and Its Variants

1.3.7.2 `count()`

2.3.1.1 Commonly Used `geom` Functions