<polite session> https://www.mncorn.org/corn-facts/
User-agent: data enthusiast
robots.txt: 2 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
STAT 220
the process of downloading, parsing, and extracting data presented in an HTML file and then converting it into a structured format that allows us to analyze it.
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
polite
packagebow
and scrape
define and realize a web harvesting sessionrvest
polite
:: bow()polite
:: scrape(){html_document}
<html class="no-js" lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="page-template page-template-page-cornfacts page-template-pag ...
HTML page consists of series of elements which browsers use to interpret how to display the content
While it is structured (hierarchical/tree based) it often is not available in a form useful for analysis (flat/tidy).
Try HTML code yourself by clicking here
HTML uses tags
to describe different aspects of document content. Rvest
is a collection of functions that make basic processing and manipulation of HTML data straight forward.
Tag | Example |
---|---|
heading | <h1>My Title</h1> |
paragraph | <p>A paragraph of content...</p> |
table | <table> ... </table> |
anchor (with attribute) | <a href="https://www.mncorn.org/">click here for link</a> |
Rvest documentation
Function | Description |
---|---|
read_html |
Read HTML data from a url or character string |
html_element |
Find HTML element using CSS selectors |
html_elements |
Find HTML elements using CSS selectors |
html_node |
Select a specified node from HTML document |
html_nodes |
Select specified nodes from HTML document |
html_table |
Parse an HTML table into a data frame |
html_text |
Extract tag pairs’ content |
html_name |
Extract tags’ names |
html_attrs |
Extract all of each tag’s attributes |
html_attr |
Extract tags’ attribute value by name |
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
[[1]]
# A tibble: 6 × 1
`Formula One`
<chr>
1 "Current season\n2024 Formula One World Championship"
2 "Related articles\nHistory of Formula One\nFormula One racing\nFormula One re…
3 "Lists\nDrivers (GP winnersSprint winnersPolesittersFastest lapsChampionsNumb…
4 "Records\nDrivers\nConstructors\nEngines\nTyres\nRaces"
5 "Organisations\nFIA\nFIA World Motor Sport Council\nFormula One Group\nLibert…
6 ".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.m…
[[2]]
# A tibble: 3 × 2
Symbol Meaning
<chr> <chr>
1 ~ Currently active world champions(driver competes in 2024 and has won t…
2 * Currently active drivers(driver competes in 2024 and has not won the W…
3 ^ Former world champions(driver has won the World Drivers' Championship …
[[3]]
# A tibble: 869 × 11
`Driver name` Nationality `Seasons competed` `Drivers' Championships`
<chr> <chr> <chr> <chr>
1 Carlo Abate Italy 1962–1963 0
2 George Abecassis United Kingdom 1951–1952 0
3 Kenny Acheson United Kingdom 1983, 1985 0
4 Andrea de Adamich Italy 1968, 1970–1973 0
5 Philippe Adams Belgium 1994 0
6 Walt Ader United States 1950 0
7 Kurt Adolff West Germany 1953 0
8 Fred Agabashian United States 1950–1957 0
9 Kurt Ahrens Jr. West Germany 1966–1969 0
10 Jack Aitken United Kingdom 2020 0
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
# `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
# `Fastest laps` <chr>, `Points[a]` <chr>
[[4]]
# A tibble: 42 × 7
Country Totaldrivers Champions Championships `Race wins` `First driver(s)`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Argentina… 25 1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
2 Australia… 18 2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
3 Austriade… 16 2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
4 Belgiumde… 24 0 0 "11\n(Ickx… Johnny Claes(195…
5 Brazildet… 32 3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
6 Canadadet… 15 1(J. Vil… 1(1997) "17\n(G. V… Peter Ryan(1961 …
7 Chile 1 0 0 "0" Eliseo Salazar(1…
8 China 1 0 0 "0" Zhou Guanyu(2022…
9 Colombiad… 3 0 0 "7\n(Monto… Ricardo Londoño(…
10 Czech Rep… 1 0 0 "0" Tomáš Enge(2001 …
# ℹ 32 more rows
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
[[5]]
# A tibble: 1 × 2
`vteFormula One drivers by country` vteFormula One drive…¹
<chr> <chr>
1 "Argentina\nAustralia\nAustria\nBelgium\nBrazil\nCanad… "Argentina\nAustralia…
# ℹ abbreviated name: ¹`vteFormula One drivers by country`
bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") %>%
scrape() %>%
html_table() %>%
purrr::pluck(3)
# A tibble: 869 × 11
`Driver name` Nationality `Seasons competed` `Drivers' Championships`
<chr> <chr> <chr> <chr>
1 Carlo Abate Italy 1962–1963 0
2 George Abecassis United Kingdom 1951–1952 0
3 Kenny Acheson United Kingdom 1983, 1985 0
4 Andrea de Adamich Italy 1968, 1970–1973 0
5 Philippe Adams Belgium 1994 0
6 Walt Ader United States 1950 0
7 Kurt Adolff West Germany 1953 0
8 Fred Agabashian United States 1950–1957 0
9 Kurt Ahrens Jr. West Germany 1966–1969 0
10 Jack Aitken United Kingdom 2020 0
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
# `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
# `Fastest laps` <chr>, `Points[a]` <chr>
CSS (Cascading Style Sheets) is a language that describes how HTML elements should be displayed.
CSS selectors:
SelectorGadget is a point-and-click CSS selector, specifically for Chrome, and it comes as a Chrome Extension (Click to install!)
Click here for a list of selectors
Select all elements that are related to that object. Next, de-select anything in yellow you do not want
MinnesotaVikings <- bow("https://www.pro-football-reference.com/teams/min/2023.htm") %>%
scrape()
MinnesotaVikings
{html_document}
<html data-version="klecko-" data-root="/home/pfr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="pfr">\n<div id="wrap">\n \n <div id="header" role="banner" ...
Click here to go to webpage
Team_Stats <- MinnesotaVikings %>%
html_elements("#div_team_stats") %>%
html_table() %>% .[[1]] # same as purrr:pluck(1)
library(magrittr)
Team_Stats %<>% # %<>% Allows for direct assignment within the pipe
set_names(.[1, ]) %>% # Set column names to the first row
janitor::clean_names() %>% # Clean names
slice(-1) %>% # Remove the first row
mutate(across(everything(), ~na_if(.x, ""))) %>% # Convert empty strings to NA
type.convert(as.is = TRUE) # Convert columns to their most appropriate type
Team_Stats %>% knitr::kable(caption = "Scrapped data for various team stats")
player | pf | yds | ply | y_p | to | fl | x1st_d | cmp | att | yds_2 | td | int | ny_a | x1st_d_2 | att_2 | yds_3 | td_2 | y_a | x1st_d_3 | pen | yds_4 | x1st_py | number_dr | sc_percent | to_percent | start | time | plays | yds_5 | pts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Team Stats | 344 | 5912 | 1071 | 5.5 | 34 | 15 | 340 | 424 | 631 | 4359 | 30 | 19 | 6.4 | 220 | 393 | 1553 | 7 | 4.0 | 79 | 89 | 670 | 41 | 179 | 34.1 | 18.4 | Own 26.6 | 2:47 | 6.15 | 32.9 | 1.85 |
Opp. Stats | 362 | 5664 | 1095 | 5.2 | 22 | 11 | 328 | 426 | 606 | 3986 | 23 | 11 | 6.1 | 208 | 446 | 1678 | 14 | 3.8 | 104 | 111 | 916 | 16 | 186 | 37.1 | 11.3 | Own 30.1 | 2:51 | 6.10 | 30.5 | 1.91 |
Lg Rank Offense | 22 | 10 | NA | NA | 31 | 30 | 10 | NA | 4 | 5 | 4 | 29 | 10.0 | NA | 28 | 29 | 30 | 24.0 | NA | NA | NA | NA | NA | 22.0 | 1.0 | 29 | 13 | 9.00 | 10.0 | 18.00 |
Lg Rank Defense | 13 | 16 | NA | NA | 19 | 8 | 16 | NA | 26 | 24 | 17 | 19 | 15.0 | NA | 13 | 8 | 11 | 4.0 | NA | NA | NA | NA | NA | 13.0 | 19.0 | 28 | 26 | 24.00 | 18.0 | 18.00 |
ca16-yourusername
repository from Github10:00