<polite session> https://www.mncorn.org/corn-facts/
    User-agent: data enthusiast
    robots.txt: 2 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent
STAT 220
the process of downloading, parsing, and extracting data presented in an HTML file and then converting it into a structured format that allows us to analyze it.
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
polite package
bow and scrape define and realize a web harvesting sessionrvestpolite:: bow()polite:: scrape(){html_document}
<html class="no-js" lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="page-template page-template-page-cornfacts page-template-pag ...
HTML page consists of series of elements which browsers use to interpret how to display the content
While it is structured (hierarchical/tree based) it often is not available in a form useful for analysis (flat/tidy).
Try HTML code yourself by clicking here
HTML uses tags to describe different aspects of document content. Rvest is a collection of functions that make basic processing and manipulation of HTML data straight forward.
| Tag | Example | 
|---|---|
| heading | <h1>My Title</h1> | 
| paragraph | <p>A paragraph of content...</p> | 
| table | <table> ... </table> | 
| anchor (with attribute) | <a href="https://www.mncorn.org/">click here for link</a> | 
Rvest documentation
| Function | Description | 
|---|---|
read_html | 
Read HTML data from a url or character string | 
html_element | 
Find HTML element using CSS selectors | 
html_elements | 
Find HTML elements using CSS selectors | 
html_node | 
Select a specified node from HTML document | 
html_nodes | 
Select specified nodes from HTML document | 
html_table | 
Parse an HTML table into a data frame | 
html_text | 
Extract tag pairs’ content | 
html_name | 
Extract tags’ names | 
html_attrs | 
Extract all of each tag’s attributes | 
html_attr | 
Extract tags’ attribute value by name | 
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
[[1]]
# A tibble: 6 × 1
  `Formula One`                                                                 
  <chr>                                                                         
1 "Current season\n2024 Formula One World Championship"                         
2 "Related articles\nHistory of Formula One\nFormula One racing\nFormula One re…
3 "Lists\nDrivers (GP winnersSprint winnersPolesittersFastest lapsChampionsNumb…
4 "Records\nDrivers\nConstructors\nEngines\nTyres\nRaces"                       
5 "Organisations\nFIA\nFIA World Motor Sport Council\nFormula One Group\nLibert…
6 ".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.m…
[[2]]
# A tibble: 3 × 2
  Symbol Meaning                                                                
  <chr>  <chr>                                                                  
1 ~      Currently active world champions(driver competes in 2024 and has won t…
2 *      Currently active drivers(driver competes in 2024 and has not won the W…
3 ^      Former world champions(driver has won the World Drivers' Championship …
[[3]]
# A tibble: 869 × 11
   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
   <chr>             <chr>          <chr>              <chr>                   
 1 Carlo Abate       Italy          1962–1963          0                       
 2 George Abecassis  United Kingdom 1951–1952          0                       
 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
 5 Philippe Adams    Belgium        1994               0                       
 6 Walt Ader         United States  1950               0                       
 7 Kurt Adolff       West Germany   1953               0                       
 8 Fred Agabashian   United States  1950–1957          0                       
 9 Kurt Ahrens Jr.   West Germany   1966–1969          0                       
10 Jack Aitken       United Kingdom 2020               0                       
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
#   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
#   `Fastest laps` <chr>, `Points[a]` <chr>
[[4]]
# A tibble: 42 × 7
   Country    Totaldrivers Champions Championships `Race wins` `First driver(s)`
   <chr>      <chr>        <chr>     <chr>         <chr>       <chr>            
 1 Argentina… 25           1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
 2 Australia… 18           2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
 3 Austriade… 16           2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
 4 Belgiumde… 24           0         0             "11\n(Ickx… Johnny Claes(195…
 5 Brazildet… 32           3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
 6 Canadadet… 15           1(J. Vil… 1(1997)       "17\n(G. V… Peter Ryan(1961 …
 7 Chile      1            0         0             "0"         Eliseo Salazar(1…
 8 China      1            0         0             "0"         Zhou Guanyu(2022…
 9 Colombiad… 3            0         0             "7\n(Monto… Ricardo Londoño(…
10 Czech Rep… 1            0         0             "0"         Tomáš Enge(2001 …
# ℹ 32 more rows
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
[[5]]
# A tibble: 1 × 2
  `vteFormula One drivers by country`                     vteFormula One drive…¹
  <chr>                                                   <chr>                 
1 "Argentina\nAustralia\nAustria\nBelgium\nBrazil\nCanad… "Argentina\nAustralia…
# ℹ abbreviated name: ¹`vteFormula One drivers by country`
bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") %>%
  scrape() %>% 
  html_table() %>%
  purrr::pluck(3) # A tibble: 869 × 11
   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
   <chr>             <chr>          <chr>              <chr>                   
 1 Carlo Abate       Italy          1962–1963          0                       
 2 George Abecassis  United Kingdom 1951–1952          0                       
 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
 5 Philippe Adams    Belgium        1994               0                       
 6 Walt Ader         United States  1950               0                       
 7 Kurt Adolff       West Germany   1953               0                       
 8 Fred Agabashian   United States  1950–1957          0                       
 9 Kurt Ahrens Jr.   West Germany   1966–1969          0                       
10 Jack Aitken       United Kingdom 2020               0                       
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
#   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
#   `Fastest laps` <chr>, `Points[a]` <chr>
CSS (Cascading Style Sheets) is a language that describes how HTML elements should be displayed.
CSS selectors:
SelectorGadget is a point-and-click CSS selector, specifically for Chrome, and it comes as a Chrome Extension (Click to install!)
Click here for a list of selectors
Select all elements that are related to that object. Next, de-select anything in yellow you do not want
MinnesotaVikings <- bow("https://www.pro-football-reference.com/teams/min/2023.htm") %>% 
  scrape()
MinnesotaVikings{html_document}
<html data-version="klecko-" data-root="/home/pfr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="pfr">\n<div id="wrap">\n  \n  <div id="header" role="banner" ...
Click here to go to webpage
Team_Stats <- MinnesotaVikings %>% 
  html_elements("#div_team_stats") %>% 
  html_table() %>% .[[1]]  # same as purrr:pluck(1)
library(magrittr) 
Team_Stats %<>% # %<>% Allows for direct assignment within the pipe
  set_names(.[1, ]) %>% # Set column names to the first row
  janitor::clean_names() %>% # Clean names
  slice(-1) %>% # Remove the first row
  mutate(across(everything(), ~na_if(.x, ""))) %>% # Convert empty strings to NA
  type.convert(as.is = TRUE) # Convert columns to their most appropriate type
Team_Stats %>% knitr::kable(caption = "Scrapped data for various team stats") | player | pf | yds | ply | y_p | to | fl | x1st_d | cmp | att | yds_2 | td | int | ny_a | x1st_d_2 | att_2 | yds_3 | td_2 | y_a | x1st_d_3 | pen | yds_4 | x1st_py | number_dr | sc_percent | to_percent | start | time | plays | yds_5 | pts | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team Stats | 344 | 5912 | 1071 | 5.5 | 34 | 15 | 340 | 424 | 631 | 4359 | 30 | 19 | 6.4 | 220 | 393 | 1553 | 7 | 4.0 | 79 | 89 | 670 | 41 | 179 | 34.1 | 18.4 | Own 26.6 | 2:47 | 6.15 | 32.9 | 1.85 | 
| Opp. Stats | 362 | 5664 | 1095 | 5.2 | 22 | 11 | 328 | 426 | 606 | 3986 | 23 | 11 | 6.1 | 208 | 446 | 1678 | 14 | 3.8 | 104 | 111 | 916 | 16 | 186 | 37.1 | 11.3 | Own 30.1 | 2:51 | 6.10 | 30.5 | 1.91 | 
| Lg Rank Offense | 22 | 10 | NA | NA | 31 | 30 | 10 | NA | 4 | 5 | 4 | 29 | 10.0 | NA | 28 | 29 | 30 | 24.0 | NA | NA | NA | NA | NA | 22.0 | 1.0 | 29 | 13 | 9.00 | 10.0 | 18.00 | 
| Lg Rank Defense | 13 | 16 | NA | NA | 19 | 8 | 16 | NA | 26 | 24 | 17 | 19 | 15.0 | NA | 13 | 8 | 11 | 4.0 | NA | NA | NA | NA | NA | 13.0 | 19.0 | 28 | 26 | 24.00 | 18.0 | 18.00 | 

ca16-yourusername repository from Github10:00