STAT 220
Data visualization is a key skill for data scientists. Useful for:
A powerful package for visualizing data used widely by academics and industries alike. Some useful resources:
ggplot2
communityEssentials
Additional elements
Specify the variables in the data frame we will be using and what role they play. Use the function aes()
within the ggplot()
function after the data frame.
Beyond your axis, you can add more aesthetics representing further dimensions of the data in the two dimensional graphic plane, such as: shape
, linewidth
, color
, fill
, alpha
to name a few.
The third layer required to create our plot (which determines the specific kind of visualization, such as a bar plot or scatter plot) involves adding a geometric object.
To do this, we should append a plus (+) at the end of the initial line and specify the desired geometric object type, like geom_point()
for a scatter plot or geom_bar()
for bar plots.
At this stage, our plot might require a few finishing touches. We might want to adjust the axis names or remove the default gray background. To accomplish this, we should add another layer, preceded by a plus sign (+)
To modify the axis names, we can use the labs()
function. Additionally, we can apply some of the pre-defined themes, such as theme_minimal()
.
ggplot(data) + # data
<geometry_funs>(aes(<variables>)) + # aesthetic variable mapping
<label_funs> + # add context
<facet_funs> + # add facets (optional)
<coordinate_funs> + # play with coords (optional)
<scale_funs> + # play with scales (optional)
<theme_funs> # play with axes, colors, etc (optional)
See the ggplot2 cheatsheets for more details
# Error bars for mpg by number of cylinders in the mtcars dataset
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_bar(stat = "summary", fun = "mean", fill = "turquoise") +
geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0.2) +
theme_minimal() +
labs(title = "Mean mpg by Number of Cylinders",
x = "Number of Cylinders",
y = "Mean mpg")
# Example data for trend plot
set.seed(42)
example_data <- data.frame(
x = 1:10,
y = 2 * (1:10) + rnorm(10, mean = 0, sd = 3),
se = runif(10, min = 1, max = 3)
)
# Trend plot with error bars
ggplot(example_data, aes(x = x, y = y)) +
geom_point(color = "skyblue", linewidth = 3) +
geom_line(color = "skyblue") +
geom_errorbar(aes(ymin = y - se, ymax = y + se), width = 0.2) +
theme_minimal()
# Uncertainty - Confidence Bands plot
set.seed(42)
example_data <- data.frame(
x = 1:10,
y = 2 * (1:10) + rnorm(10, mean = 0, sd = 3),
se = runif(10, min = 1, max = 3)
)
ggplot(example_data, aes(x = x, y = y)) +
geom_point(color = "skyblue", linewidth = 3) +
geom_smooth(method = "loess", se = TRUE,
color = "skyblue",
fill = "skyblue", alpha = 0.3) +
theme_minimal()
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.5, color = "green") +
geom_smooth(method = "lm", color = "black", size = 1, se = FALSE) +
scale_y_log10() +
theme_minimal() +
labs(x = "Carat", y = "Price (log scale)",
title = "Scatter plot with Linear Fit of Log-scaled Price by Carat Weight") -> p2
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.4, color = "red") +
geom_smooth(method = "lm", color = "black", size = 1, se = FALSE) +
scale_x_log10() + scale_y_log10() +
theme_minimal() +
labs(x = "Carat (log scale)", y = "Price (log scale)",
title = "Log-Log Scatter plot with Linear Model Fit of Price by Carat Weight") -> p3
30:00