Web Scraping

STAT 220

Bastola

Web scraping

the process of downloading, parsing, and extracting data presented in an HTML file and then converting it into a structured format that allows us to analyze it.

Two different scenarios:

  1. Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).

  2. Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.

polite package

  • Two main functions bow and scrape define and realize a web harvesting session
  • Builds on awesome toolkits for defining and managing http sessions using rvest

Polite Documentation

Can we scrape this webpage?

https://www.mncorn.org/corn-facts/

polite:: bow()

bow("https://www.mncorn.org/corn-facts/", user_agent = "data enthusiast")
<polite session> https://www.mncorn.org/corn-facts/
    User-agent: data enthusiast
    robots.txt: 2 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

polite:: scrape()

bow("https://www.mncorn.org/corn-facts/", user_agent = "data enthusiast") %>% 
  scrape()
{html_document}
<html class="no-js" lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="page-template page-template-page-cornfacts page-template-pag ...

HyperText Markup Language (HTML)

HTML page consists of series of elements which browsers use to interpret how to display the content

HyperText Markup Language (HTML)

While it is structured (hierarchical/tree based) it often is not available in a form useful for analysis (flat/tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

Try HTML code yourself by clicking here

HTML tags

HTML uses tags to describe different aspects of document content. Rvest is a collection of functions that make basic processing and manipulation of HTML data straight forward.

Tag Example
heading <h1>My Title</h1>
paragraph <p>A paragraph of content...</p>
table <table> ... </table>
anchor (with attribute) <a href="https://www.mncorn.org/">click here for link</a>

Rvest documentation

Useful functions

Function Description
read_html Read HTML data from a url or character string
html_element Find HTML element using CSS selectors
html_elements Find HTML elements using CSS selectors
html_node Select a specified node from HTML document
html_nodes Select specified nodes from HTML document
html_table Parse an HTML table into a data frame
html_text Extract tag pairs’ content
html_name Extract tags’ names
html_attrs Extract all of each tag’s attributes
html_attr Extract tags’ attribute value by name

https://en.wikipedia.org/wiki/List_of_Formula_One_drivers

Read Wikipedia Tables into R

bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") 
<polite session> https://en.wikipedia.org/wiki/List_of_Formula_One_drivers
    User-agent: polite R package
    robots.txt: 456 rules are defined for 33 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

Read Wikipedia Tables into R

bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") %>%
  scrape() 
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

Read Wikipedia Tables into R

bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") %>%
  scrape() %>% 
  html_table() 
[[1]]
# A tibble: 6 × 1
  `Formula One`                                                                 
  <chr>                                                                         
1 "Current season\n2024 Formula One World Championship"                         
2 "Related articles\nHistory of Formula One\nFormula One racing\nFormula One re…
3 "Lists\nDrivers (GP winnersSprint winnersPolesittersFastest lapsChampionsNumb…
4 "Records\nDrivers\nConstructors\nEngines\nTyres\nRaces"                       
5 "Organisations\nFIA\nFIA World Motor Sport Council\nFormula One Group\nLibert…
6 ".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.m…

[[2]]
# A tibble: 3 × 2
  Symbol Meaning                                                                
  <chr>  <chr>                                                                  
1 ~      Currently active world champions(driver competes in 2024 and has won t…
2 *      Currently active drivers(driver competes in 2024 and has not won the W…
3 ^      Former world champions(driver has won the World Drivers' Championship …

[[3]]
# A tibble: 869 × 11
   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
   <chr>             <chr>          <chr>              <chr>                   
 1 Carlo Abate       Italy          1962–1963          0                       
 2 George Abecassis  United Kingdom 1951–1952          0                       
 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
 5 Philippe Adams    Belgium        1994               0                       
 6 Walt Ader         United States  1950               0                       
 7 Kurt Adolff       West Germany   1953               0                       
 8 Fred Agabashian   United States  1950–1957          0                       
 9 Kurt Ahrens Jr.   West Germany   1966–1969          0                       
10 Jack Aitken       United Kingdom 2020               0                       
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
#   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
#   `Fastest laps` <chr>, `Points[a]` <chr>

[[4]]
# A tibble: 42 × 7
   Country    Totaldrivers Champions Championships `Race wins` `First driver(s)`
   <chr>      <chr>        <chr>     <chr>         <chr>       <chr>            
 1 Argentina… 25           1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
 2 Australia… 18           2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
 3 Austriade… 16           2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
 4 Belgiumde… 24           0         0             "11\n(Ickx… Johnny Claes(195…
 5 Brazildet… 32           3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
 6 Canadadet… 15           1(J. Vil… 1(1997)       "17\n(G. V… Peter Ryan(1961 …
 7 Chile      1            0         0             "0"         Eliseo Salazar(1…
 8 China      1            0         0             "0"         Zhou Guanyu(2022…
 9 Colombiad… 3            0         0             "7\n(Monto… Ricardo Londoño(…
10 Czech Rep… 1            0         0             "0"         Tomáš Enge(2001 …
# ℹ 32 more rows
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>

[[5]]
# A tibble: 1 × 2
  `vteFormula One drivers by country`                     vteFormula One drive…¹
  <chr>                                                   <chr>                 
1 "Argentina\nAustralia\nAustria\nBelgium\nBrazil\nCanad… "Argentina\nAustralia…
# ℹ abbreviated name: ¹​`vteFormula One drivers by country`

Read Wikipedia Tables into R

bow("https://en.wikipedia.org/wiki/List_of_Formula_One_drivers") %>%
  scrape() %>% 
  html_table() %>%
  purrr::pluck(3) 
# A tibble: 869 × 11
   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
   <chr>             <chr>          <chr>              <chr>                   
 1 Carlo Abate       Italy          1962–1963          0                       
 2 George Abecassis  United Kingdom 1951–1952          0                       
 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
 5 Philippe Adams    Belgium        1994               0                       
 6 Walt Ader         United States  1950               0                       
 7 Kurt Adolff       West Germany   1953               0                       
 8 Fred Agabashian   United States  1950–1957          0                       
 9 Kurt Ahrens Jr.   West Germany   1966–1969          0                       
10 Jack Aitken       United Kingdom 2020               0                       
# ℹ 859 more rows
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
#   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
#   `Fastest laps` <chr>, `Points[a]` <chr>

CSS

  • CSS (Cascading Style Sheets) is a language that describes how HTML elements should be displayed.

  • CSS selectors:

    • shortcuts for selecting HTML elements to style
    • can also be used to extract the content of these elements

SelectorGadget

SelectorGadget is a point-and-click CSS selector, specifically for Chrome, and it comes as a Chrome Extension (Click to install!)

Click here for a list of selectors

SelectorGadget

Select all elements that are related to that object. Next, de-select anything in yellow you do not want

Read HTML into R

MinnesotaVikings <- bow("https://www.pro-football-reference.com/teams/min/2023.htm") %>% 
  scrape()
MinnesotaVikings
{html_document}
<html data-version="klecko-" data-root="/home/pfr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="pfr">\n<div id="wrap">\n  \n  <div id="header" role="banner" ...

Click here to go to webpage

Scrapped Table

Team_Stats <- MinnesotaVikings %>% 
  html_elements("#div_team_stats") %>% 
  html_table() %>% .[[1]]  # same as purrr:pluck(1)

library(magrittr) 
Team_Stats %<>% # %<>% Allows for direct assignment within the pipe
  set_names(.[1, ]) %>% # Set column names to the first row
  janitor::clean_names() %>% # Clean names
  slice(-1) %>% # Remove the first row
  mutate(across(everything(), ~na_if(.x, ""))) %>% # Convert empty strings to NA
  type.convert(as.is = TRUE) # Convert columns to their most appropriate type

Team_Stats %>% knitr::kable(caption = "Scrapped data for various team stats") 
Scrapped data for various team stats
player pf yds ply y_p to fl x1st_d cmp att yds_2 td int ny_a x1st_d_2 att_2 yds_3 td_2 y_a x1st_d_3 pen yds_4 x1st_py number_dr sc_percent to_percent start time plays yds_5 pts
Team Stats 344 5912 1071 5.5 34 15 340 424 631 4359 30 19 6.4 220 393 1553 7 4.0 79 89 670 41 179 34.1 18.4 Own 26.6 2:47 6.15 32.9 1.85
Opp. Stats 362 5664 1095 5.2 22 11 328 426 606 3986 23 11 6.1 208 446 1678 14 3.8 104 111 916 16 186 37.1 11.3 Own 30.1 2:51 6.10 30.5 1.91
Lg Rank Offense 22 10 NA NA 31 30 10 NA 4 5 4 29 10.0 NA 28 29 30 24.0 NA NA NA NA NA 22.0 1.0 29 13 9.00 10.0 18.00
Lg Rank Defense 13 16 NA NA 19 8 16 NA 26 24 17 19 15.0 NA 13 8 11 4.0 NA NA NA NA NA 13.0 19.0 28 26 24.00 18.0 18.00

 Group Activity 1


  • Please clone the ca16-yourusername repository from Github
  • Please do the problem 1 in the class activity for today

10:00