Graphics with ggplot2

STAT 220

Bastola

Visualization in the data science workflow

Data visualization is a key skill for data scientists. Useful for:

  • Identification of outliers
  • Guidance of recoding operations
  • Summarize distributions
  • Discover patterns, relationships
  • Visualize uncertainty

Which visualization do you prefer?

Which visualization do you prefer?

Which quantity do I want to visualize?

  • Amounts
  • Distributions
  • Proportions
  • Associations
  • Trends
  • Estimates
  • Uncertainty

Which question do I want to answer?

  • “Is the distribution normal (or uniform or…)?” \(\rightarrow\) Histogram, density plot, Q-Q plot
  • “Are univariate distributions across subgroups different?” \(\rightarrow\) Boxplots
  • “How do differences in amounts between groups compare?” \(\rightarrow\) Barplot
  • “What is the relationship between \(\mathrm{x}\) and \(\mathrm{y}\) ?” \(\rightarrow\) Scatterplot, contour plot, hex bins
  • “Are the data clustered by subgroup?” \(\rightarrow\) Scatterplot with color
  • “How uncertain are estimates? \(\rightarrow\) Error bars, confidence bands

ggplot2 — Overview

A powerful package for visualizing data used widely by academics and industries alike. Some useful resources:

Our building blocks 🧱

Essentials

  • Data: the data frame, or data frames, we will use to plot
  • Aesthetics: the variables we will be working with
  • Geometric objects: the type of visualization

Our building blocks 🧱

Additional elements

  • Theme adjustments: linewidth, text, colors etc
  • Facets
  • Coordinate system
  • Statistical transformations
  • Position adjustments
  • Scales

Data

In ggplot2, we always specify a data frame with:

ggplot(name_of_your_df)

Aesthetics

Specify the variables in the data frame we will be using and what role they play. Use the function aes() within the ggplot() function after the data frame.

ggplot(name_of_your_df, aes(x = your_x_axis_variable, 
                            y = your_y_axis_variable))

Beyond your axis, you can add more aesthetics representing further dimensions of the data in the two dimensional graphic plane, such as: shape, linewidth, color, fill, alpha to name a few.

Geometric objects

The third layer required to create our plot (which determines the specific kind of visualization, such as a bar plot or scatter plot) involves adding a geometric object.

To do this, we should append a plus (+) at the end of the initial line and specify the desired geometric object type, like geom_point() for a scatter plot or geom_bar() for bar plots.

ggplot(name_of_your_df, aes(x = your_x_axis_variable, 
                            y = your_y_axis_variable)) +
  geom_point()

Theme and Axes

At this stage, our plot might require a few finishing touches. We might want to adjust the axis names or remove the default gray background. To accomplish this, we should add another layer, preceded by a plus sign (+)

To modify the axis names, we can use the labs() function. Additionally, we can apply some of the pre-defined themes, such as theme_minimal().

ggplot(name_of_your_df, aes(x = your_x_axis_variable,
                            y = your_y_axis_variable)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Your x label",
       y = "Your y label",
       title = "Your ploty title)

Common ggplot2 options

ggplot(data) +    # data
  <geometry_funs>(aes(<variables>)) + # aesthetic variable mapping
  <label_funs> +  # add context
  <facet_funs> +  # add facets (optional)
  <coordinate_funs> +  # play with coords (optional)
  <scale_funs> + # play with scales (optional)
  <theme_funs> # play with axes, colors, etc (optional)

Example Visualization 1

# Histogram of mpg in the mtcars dataset
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, 
                 fill = "cornflowerblue") +
  theme_minimal() +
  labs(title = "Distribution of Miles per Gallon (mpg)")

Example Visualization 2

# Histogram with density overlay for mpg in the mtcars dataset
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 2, 
                 fill = "skyblue", 
                 alpha = 0.5) +
  geom_density(color = "firebrick") +
  theme_minimal() +
  labs(title = "Histogram and Density of Miles per Gallon (mpg)")

Example Visualization 3

# Barplot of number of cars by number of cylinders in the mtcars dataset
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Count of Cars by Cylinder Count",
       x = "Number of Cylinders", 
       y = "Count")

Example Visualization 4

# Boxplot of mpg by number of cylinders in the mtcars dataset
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Miles per Gallon (mpg) by Cylinder Count", 
       x = "Number of Cylinders", 
       y = "mpg")

Example Visualization 5

# Scatterplot of mpg vs. weight in the mtcars dataset
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "purple") +
  theme_minimal() +
  labs(title = "Miles per Gallon (mpg) vs. Weight", 
       x = "Weight", 
       y = "mpg")

Example Visualization 6

# Contour plot of mpg vs. weight in the mtcars dataset
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_density_2d(color = "chocolate") +
  theme_minimal() +
  labs(title = "Miles per Gallon (mpg) vs. Weight",
       x = "Weight", 
       y = "mpg")

Example Visualization 7

# Q-Q plot for mpg in the mtcars dataset
ggplot(mtcars, aes(sample = mpg)) + 
  stat_qq(color = "darkorange") +
  geom_qq_line(color = "maroon") +
  theme_minimal() +
  labs(title = "Q-Q Plot for Miles per Gallon (mpg)")

Example Visualization 8

# Error bars for mpg by number of cylinders in the mtcars dataset
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_bar(stat = "summary", fun = "mean", fill = "turquoise") +
  geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0.2) +
  theme_minimal() +
  labs(title = "Mean mpg by Number of Cylinders", 
       x = "Number of Cylinders", 
       y = "Mean mpg")

Example Visualization 9

# Example data for trend plot
set.seed(42)
example_data <- data.frame(
  x = 1:10,
  y = 2 * (1:10) + rnorm(10, mean = 0, sd = 3),
  se = runif(10, min = 1, max = 3)
)
# Trend plot with error bars
ggplot(example_data, aes(x = x, y = y)) +
  geom_point(color = "skyblue", linewidth = 3) +
  geom_line(color = "skyblue") +
  geom_errorbar(aes(ymin = y - se, ymax = y + se), width = 0.2) +
  theme_minimal()

Example Visualization 10

# Uncertainty - Confidence Bands plot
set.seed(42)
example_data <- data.frame(
  x = 1:10,
  y = 2 * (1:10) + rnorm(10, mean = 0, sd = 3),
  se = runif(10, min = 1, max = 3)
)

ggplot(example_data, aes(x = x, y = y)) +
  geom_point(color = "skyblue", linewidth = 3) +
  geom_smooth(method = "loess", se = TRUE, 
              color = "skyblue", 
              fill = "skyblue", alpha = 0.3) +
  theme_minimal()

Example Visualization 11

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.5) +  
  theme_minimal() +
  labs(x = "Carat", y = "Price (log scale)", 
       title = "Scatter plot of Diamond Price by Carat Weight") -> p1

Example Visualization 12

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.5, color = "green") +
  geom_smooth(method = "lm", color = "black", size = 1, se = FALSE) +
  scale_y_log10() +
  theme_minimal() +
  labs(x = "Carat", y = "Price (log scale)", 
       title = "Scatter plot with Linear Fit of Log-scaled Price by Carat Weight") -> p2

Example Visualization 13

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.4, color = "red") +
  geom_smooth(method = "lm", color = "black", size = 1, se = FALSE) +
  scale_x_log10() + scale_y_log10() +
  theme_minimal() +
  labs(x = "Carat (log scale)", y = "Price (log scale)", 
       title = "Log-Log Scatter plot with Linear Model Fit of Price by Carat Weight") -> p3

Example Visualization 14

library(ggplot2)
library(patchwork)

combined_plot <- (p1 | p2) / p3 

combined_plot + 
  plot_layout(guides = 'collect') + 
  plot_annotation(title = "Layering Geoms with Patchwork")

 Group Activity 1

  • Please clone the ca4-yourusername repository from Github
  • Please do the problems on the class activity for today
  • Submit to Gradescope on moodle when done!

30:00