R and R Markdown Basics

STAT 220

Bastola

Reproducible data science

What Does Reproducibility Mean in Data Science?

Short-term goals

  • Are the tables and figures reproducible from the code and data?
  • Does the code work as intended?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Reproducible data science

What Does Reproducibility Mean in Data Science?

Long-term goals

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability → R
  • Literate programming (code, narrative, output in one place) → Posit
  • Version control → Git / GitHub

Tour: R Markdown (.Rmd content)


Tour: R Markdown (Rendered html)


Text

Simple rules for:

  • section headers (#,##,etc)
  • lists (need ~2 tabs to create sublists)
  • formatting (bold **, italics *)
  • tables
  • R syntax (use backward tick `)
  • web links [linked text](url)
  • latex math equations \(\beta_1 + \beta_2\)

Code chunks, defined by three backticks

```{r}
library(babynames)
filtered_names <- babynames %>% filter(name=="Amiee", year < max(year), year > min(year)) 
ggplot(data=filtered_names, aes(x=year, y=prop)) + 
  geom_line(aes(colour=sex)) + 
  labs( x = 'Year', 
        y = 'Prop. of Babies Named Aimee')
```

Adding chunks

Add chunks with button or:

  • Command (or Cmd) + Option (or Alt) + i (Mac)
  • Ctrl + Alt + i (Windows/Linux)

Running chunks

Run chunks by:

  • Run current chunk button (interactive)
  • Knit button / run all chunks

Inline code

How many babies were born with name ‘Aimee’?

`r filtered_names %>% summarise(total = sum(n))`

There are a total of 53476 babies.


In what year were there highest proportion of babies born with the name Aimee?

`r filtered_names %>% filter(prop == max(prop)) %>% pull(year)`

Aimee name was the most popular in 1973.

Chunk labels

```{r peek, echo = FALSE, results = "hide"}
glimpse(filtered_names)
```
  • Include the chunk label immediately after the language identifier within curly braces, –> {r label}
    • Warning: Do not duplicate chunk labels
  • Configure options after label separated by commas e.g –> echo = FALSE
```{r peek, echo = TRUE, results = "show"}
glimpse(filtered_names)
```
glimpse(filtered_names)
Rows: 150
Columns: 5
$ year <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890,…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "A…
$ n    <int> 13, 11, 13, 11, 15, 17, 17, 18, 12, 16, 18, 14, 15, 17, 13, 13, 2…
$ prop <dbl> 0.00013319, 0.00011127, 0.00011236, 0.00009162, 0.00010902, 0.000…

Chunk options: : echo and eval

```{r echo = TRUE, eval = FALSE}
glimpse(filtered_names)
```
glimpse(filtered_names)
```{r echo = TRUE, eval = TRUE}
glimpse(filtered_names)
```
glimpse(filtered_names)
Rows: 150
Columns: 5
$ year <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890,…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "Aimee", "A…
$ n    <int> 13, 11, 13, 11, 15, 17, 17, 18, 12, 16, 18, 14, 15, 17, 13, 13, 2…
$ prop <dbl> 0.00013319, 0.00011127, 0.00011236, 0.00009162, 0.00010902, 0.000…

Saving images using chunk options

```{r plot1, fig.path="img/"}
library(babynames)
your_name <- "Dee"
your_name_data <- babynames %>% filter(name == your_name)

ggplot(data=your_name_data, aes(x=year, y=prop)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_line(aes(colour = sex), size = 1) +
  scale_color_brewer(palette = "Set1") +
  labs( x = 'Year',
        y = stringr::str_c('Prop. of Babies Named ', your_name),
        title = stringr::str_c('Trends in Names: ', your_name)) 
```

Loading saved images

knitr::include_graphics("img/plot1-1.png")

Summary

Chunk Option Outcome
echo = FALSE The code is not included in the final document.
include = FALSE Neither the code nor its results appear in the document. However, the code executes, and results can be used later.
message = FALSE Any messages produced by the code are not shown in the document.
warning = FALSE Any warnings generated by the code are omitted from the document.

 Group Activity 1

  • Please clone the ca2-yourusername repository from Github
  • Do the class activity in your group
  • You may find the class helper web page useful
  • Regularly push your changes back to Github

10:00

Variables

Variables are used to store data, figures, model output, etc.

Assign just one value

x <- 5
x
[1] 5

Assign a vector of values

a <- 3:10
a
[1]  3  4  5  6  7  8  9 10

Concatenate a string of numbers

b <- c(5, 12, 2, 100, 8)
b
[1]   5  12   2 100   8

Concatenate a string of characters

names <- c("Amy", "Dee", "Lux")
names
[1] "Amy" "Dee" "Lux"

A few things to remember

  • Do not use special characters such as $ or %. Common symbols that are used in variable names include . or _.
  • Remember that R is case sensitive.
  • To assign values to objects, we use the assignment operator <-. Recommend to use <- to assign values to objects and = within functions.
  • The # symbol is used for commenting and demarcation. Any code following # will not be executed.

R Objects

  • Vectors and data frames are examples of objects in R.
  • There are other types of R objects to store data, such as matrices, arrays, lists.

 Group Activity 2

  • Please do Problem 2 on the class activity for today
  • Submit to Gradescope on moodle when done!

10:00