More HW2 questions (thank you for bringing them up)
weekly total case = weekly case = sum of new cases in a given week
.
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.33 R6_2.5.1 fastmap_1.1.1 xfun_0.39
## [5] cachem_1.0.8 knitr_1.42 htmltools_0.5.5 rmarkdown_2.21
## [9] cli_3.6.1 sass_0.4.6 jquerylib_0.1.4 compiler_4.3.0
## [13] rstudioapi_0.14 tools_4.3.0 evaluate_0.20 bslib_0.4.2
## [17] yaml_2.3.7 rlang_1.1.1 jsonlite_1.8.4
Dr. Hua Zhou’s slides
Josh McCrain’s RSelenium tutorial
HTML Introduction from GeeksforGeeks
Getting started with HTML MDN Web Docs
Load tidyverse and other packages for this lecture:
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("rvest")
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
We cover some survival amount of instroduction of HTML format first.
HTML stands for HyperText Markup Language.
used to design web pages using a markup language
combination of Hypertext and Markup language
Hypertext defines the link between the web pages
A markup language is used to define the text document within tag which defines the structure of web pages.
Elements can also have attributes. Attributes look like this:
Attributes contain extra information about the element that won’t appear in the content.
In this example, the class
attribute is an
identifying name used to target the element with style
information.
An attribute should have:
A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)
The attribute name, followed by an equal sign.
An attribute value, wrapped with opening and closing quote marks.
Another example of an element is <a>
. This stands
for anchor. An anchor can make the text it encloses into a
hyperlink. Anchors can take a number of attributes, but
several are as follows:
href
: This attribute’s value specifies the web
address for the link. For example:
href="https://www.mozilla.org/"
.
title
: The title
attribute specifies
extra information about the link, such as a description of the page that
is being linked to. For example,
title="The Mozilla homepage"
. This appears as a tooltip
when a cursor hovers over the element.
target
: The target
attribute specifies
the browsing context used to display the link. For example,
target="_blank"
will display the link in a new tab. If you
want to display the linked content in the current tab, just omit this
attribute.
The basic structure of an HTML page is laid out below.
It contains the essential building-block elements upon which all web pages are created.
doctype declaration
HTML
head
title
body elements
To write an HTML comment, wrap it in the special markers
<!-- and -->
. For example:
<p>I'm not inside a comment</p>
generates:
<!-- <p>I am!</p> -->
I’m not inside a comment
There is a wealth of data on internet. How to scrape them and analyze them?
rvest is an R package written by Hadley Wickham which makes web scraping easy.
We follow instructions in a Blog by SAURAV KAUSHIK to find the most popular feature films of 2019.
Install the SelectorGadget extension for Chrome.
The 100 most popular feature films released in 2019 can be accessed at page https://www.imdb.com/search/title?count=100&release_date=2019,2019&title_type=feature.
#Loading the rvest and tidyverse package
#Specifying the url for desired website to be scraped
url <- "http://www.imdb.com/search/title?count=100&release_date=2019,2019&title_type=feature"
#Reading the HTML code from the website
# (webpage <- read_html(url)) # This line gives me an error with later commands.
# As pointed out https://stackoverflow.com/questions/56261745/r-rvest-error-error-in-doc-namespacesdoc-external-pointer-is-not-valid
(webpage <- xml2::read_html(url))
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
Suppose we want to scrape following 11 features from this page:
Use the CSS selector to get the rankings
# Use CSS selectors to scrap the rankings section
(rank_data_html <- html_nodes(webpage, '.text-primary'))
## {xml_nodeset (100)}
## [1] <span class="lister-item-index unbold text-primary">1.</span>
## [2] <span class="lister-item-index unbold text-primary">2.</span>
## [3] <span class="lister-item-index unbold text-primary">3.</span>
## [4] <span class="lister-item-index unbold text-primary">4.</span>
## [5] <span class="lister-item-index unbold text-primary">5.</span>
## [6] <span class="lister-item-index unbold text-primary">6.</span>
## [7] <span class="lister-item-index unbold text-primary">7.</span>
## [8] <span class="lister-item-index unbold text-primary">8.</span>
## [9] <span class="lister-item-index unbold text-primary">9.</span>
## [10] <span class="lister-item-index unbold text-primary">10.</span>
## [11] <span class="lister-item-index unbold text-primary">11.</span>
## [12] <span class="lister-item-index unbold text-primary">12.</span>
## [13] <span class="lister-item-index unbold text-primary">13.</span>
## [14] <span class="lister-item-index unbold text-primary">14.</span>
## [15] <span class="lister-item-index unbold text-primary">15.</span>
## [16] <span class="lister-item-index unbold text-primary">16.</span>
## [17] <span class="lister-item-index unbold text-primary">17.</span>
## [18] <span class="lister-item-index unbold text-primary">18.</span>
## [19] <span class="lister-item-index unbold text-primary">19.</span>
## [20] <span class="lister-item-index unbold text-primary">20.</span>
## ...
# (rank_data_html <- html_nodes(webpage, '.lister-item-content .text-primary'))
# Convert the ranking data to text
(rank_data <- html_text(rank_data_html))
## [1] "1." "2." "3." "4." "5." "6." "7." "8." "9." "10."
## [11] "11." "12." "13." "14." "15." "16." "17." "18." "19." "20."
## [21] "21." "22." "23." "24." "25." "26." "27." "28." "29." "30."
## [31] "31." "32." "33." "34." "35." "36." "37." "38." "39." "40."
## [41] "41." "42." "43." "44." "45." "46." "47." "48." "49." "50."
## [51] "51." "52." "53." "54." "55." "56." "57." "58." "59." "60."
## [61] "61." "62." "63." "64." "65." "66." "67." "68." "69." "70."
## [71] "71." "72." "73." "74." "75." "76." "77." "78." "79." "80."
## [81] "81." "82." "83." "84." "85." "86." "87." "88." "89." "90."
## [91] "91." "92." "93." "94." "95." "96." "97." "98." "99." "100."
# Turn into numerical values
(rank_data <- as.integer(rank_data))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
Use SelectorGadget to find the CSS selector
.lister-item-header a
.
# Using CSS selectors to scrap the title section
(title_data_html <- html_nodes(webpage, '.lister-item-header a'))
## {xml_nodeset (100)}
## [1] <a href="/title/tt8772262/?ref_=adv_li_tt">Midsommar</a>
## [2] <a href="/title/tt7286456/?ref_=adv_li_tt">Joker</a>
## [3] <a href="/title/tt7131622/?ref_=adv_li_tt">Once Upon a Time in Hollywood ...
## [4] <a href="/title/tt8367814/?ref_=adv_li_tt">The Gentlemen</a>
## [5] <a href="/title/tt4126476/?ref_=adv_li_tt">After</a>
## [6] <a href="/title/tt5606664/?ref_=adv_li_tt">Doctor Sleep</a>
## [7] <a href="/title/tt6751668/?ref_=adv_li_tt">Parasite</a>
## [8] <a href="/title/tt8946378/?ref_=adv_li_tt">Knives Out</a>
## [9] <a href="/title/tt7349950/?ref_=adv_li_tt">It Chapter Two</a>
## [10] <a href="/title/tt4154796/?ref_=adv_li_tt">Avengers: Endgame</a>
## [11] <a href="/title/tt3281548/?ref_=adv_li_tt">Little Women</a>
## [12] <a href="/title/tt6146586/?ref_=adv_li_tt">John Wick: Chapter 3 - Parabe ...
## [13] <a href="/title/tt7798634/?ref_=adv_li_tt">Ready or Not</a>
## [14] <a href="/title/tt6857112/?ref_=adv_li_tt">Us</a>
## [15] <a href="/title/tt1302006/?ref_=adv_li_tt">The Irishman</a>
## [16] <a href="/title/tt0837563/?ref_=adv_li_tt">Pet Sematary</a>
## [17] <a href="/title/tt6105098/?ref_=adv_li_tt">The Lion King</a>
## [18] <a href="/title/tt2527338/?ref_=adv_li_tt">Star Wars: The Rise Of Skywal ...
## [19] <a href="/title/tt1950186/?ref_=adv_li_tt">Ford v Ferrari</a>
## [20] <a href="/title/tt0437086/?ref_=adv_li_tt">Alita: Battle Angel</a>
## ...
# Converting the title data to text
(title_data <- html_text(title_data_html))
## [1] "Midsommar"
## [2] "Joker"
## [3] "Once Upon a Time in Hollywood"
## [4] "The Gentlemen"
## [5] "After"
## [6] "Doctor Sleep"
## [7] "Parasite"
## [8] "Knives Out"
## [9] "It Chapter Two"
## [10] "Avengers: Endgame"
## [11] "Little Women"
## [12] "John Wick: Chapter 3 - Parabellum"
## [13] "Ready or Not"
## [14] "Us"
## [15] "The Irishman"
## [16] "Pet Sematary"
## [17] "The Lion King"
## [18] "Star Wars: The Rise Of Skywalker"
## [19] "Ford v Ferrari"
## [20] "Alita: Battle Angel"
## [21] "Aladdin"
## [22] "The Lighthouse"
## [23] "1917"
## [24] "I See You"
## [25] "Jojo Rabbit"
## [26] "Curiosa"
## [27] "Scary Stories to Tell in the Dark"
## [28] "Queen of Hearts"
## [29] "Uncut Gems"
## [30] "The Blackout"
## [31] "The King"
## [32] "Captain Marvel"
## [33] "Vivarium"
## [34] "The Lodge"
## [35] "Haunt"
## [36] "Five Feet Apart"
## [37] "Ma"
## [38] "The Platform"
## [39] "Brightburn"
## [40] "Anna"
## [41] "Bombshell"
## [42] "Shazam!"
## [43] "Terminator: Dark Fate"
## [44] "El Camino: A Breaking Bad Movie"
## [45] "Escape Room"
## [46] "Ad Astra"
## [47] "Pokémon: Detective Pikachu"
## [48] "Booksmart"
## [49] "Official Secrets"
## [50] "Toy Story 4"
## [51] "Glass"
## [52] "Portrait of a Lady on Fire"
## [53] "Polar"
## [54] "Fast & Furious Presents: Hobbs & Shaw"
## [55] "Rocketman"
## [56] "Men in Black: International"
## [57] "The Curse of La Llorona"
## [58] "Annabelle Comes Home"
## [59] "6 Underground"
## [60] "Cats"
## [61] "The Gangster, the Cop, the Devil"
## [62] "Marriage Story"
## [63] "Child's Play"
## [64] "Hellboy"
## [65] "The Goldfinch"
## [66] "Jumanji: The Next Level"
## [67] "The Addams Family"
## [68] "Zombieland: Double Tap"
## [69] "Angel Has Fallen"
## [70] "Fractured"
## [71] "Spider-Man: Far from Home"
## [72] "Saint Maud"
## [73] "The Dead Don't Die"
## [74] "Midway"
## [75] "The Peanut Butter Falcon"
## [76] "Tolkien"
## [77] "Yesterday"
## [78] "In the Tall Grass"
## [79] "Dark Phoenix"
## [80] "Burn"
## [81] "Cold Pursuit"
## [82] "Frozen II"
## [83] "3 from Hell"
## [84] "In the Shadow of the Moon"
## [85] "Hustlers"
## [86] "Dark Waters"
## [87] "Sound of Metal"
## [88] "Murder Mystery"
## [89] "Synchronic"
## [90] "Triple Frontier"
## [91] "Downton Abbey"
## [92] "Color Out of Space"
## [93] "Crawl"
## [94] "Guns Akimbo"
## [95] "Rambo: Last Blood"
## [96] "Godzilla: King of the Monsters"
## [97] "The Silence"
## [98] "Maleficent: Mistress of Evil"
## [99] "The Outpost"
## [100] "The Lego Movie 2: The Second Part"
# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.lister-item-content .ratings-bar+.text-muted'))
## {xml_nodeset (100)}
## [1] <p class="text-muted">\nA couple travels to Northern Europe to visit a r ...
## [2] <p class="text-muted">\nDuring the 1980s, a failed stand-up comedian is ...
## [3] <p class="text-muted">\nA faded television actor and his stunt double st ...
## [4] <p class="text-muted">\nAn American expat tries to sell off his highly p ...
## [5] <p class="text-muted">\nA young woman falls for a guy with a dark secret ...
## [6] <p class="text-muted">\nYears following the events of <a href="/title/tt ...
## [7] <p class="text-muted">\nGreed and class discrimination threaten the newl ...
## [8] <p class="text-muted">\nA detective investigates the death of the patria ...
## [9] <p class="text-muted">\nTwenty-seven years after their first encounter w ...
## [10] <p class="text-muted">\nAfter the devastating events of <a href="/title/ ...
## [11] <p class="text-muted">\nJo March reflects back and forth on her life, te ...
## [12] <p class="text-muted">\nJohn Wick is on the run after killing a member o ...
## [13] <p class="text-muted">\nA bride's wedding night takes a sinister turn wh ...
## [14] <p class="text-muted">\nA family's serene beach vacation turns to chaos ...
## [15] <p class="text-muted">\nAn illustration of Frank Sheeran's life, from W. ...
## [16] <p class="text-muted">\nDr. Louis Creed and his wife, Rachel, relocate f ...
## [17] <p class="text-muted">\nAfter the murder of his father, a young lion pri ...
## [18] <p class="text-muted">\nIn the riveting conclusion of the landmark Skywa ...
## [19] <p class="text-muted">\nAmerican car designer <a href="/name/nm0790961"> ...
## [20] <p class="text-muted">\nA deactivated cyborg's revived, but can't rememb ...
## ...
# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)
## [1] "\nA couple travels to Northern Europe to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [2] "\nDuring the 1980s, a failed stand-up comedian is driven insane and turns to a life of crime and chaos in Gotham City while becoming an infamous psychopathic crime figure."
## [3] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."
## [4] "\nAn American expat tries to sell off his highly profitable marijuana empire in London, triggering plots, schemes, bribery and blackmail in an attempt to steal his domain out from under him."
## [5] "\nA young woman falls for a guy with a dark secret and the two embark on a rocky relationship. Based on the novel by Anna Todd."
## [6] "\nYears following the events of The Shining (1980), a now-adult Dan Torrance must protect a young girl with similar powers from a cult known as The True Knot, who prey on children with powers to remain immortal."
# strip the '\n'
description_data <- str_replace_all(description_data, "^\\n", "")
head(description_data)
## [1] "A couple travels to Northern Europe to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [2] "During the 1980s, a failed stand-up comedian is driven insane and turns to a life of crime and chaos in Gotham City while becoming an infamous psychopathic crime figure."
## [3] "A faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."
## [4] "An American expat tries to sell off his highly profitable marijuana empire in London, triggering plots, schemes, bribery and blackmail in an attempt to steal his domain out from under him."
## [5] "A young woman falls for a guy with a dark secret and the two embark on a rocky relationship. Based on the novel by Anna Todd."
## [6] "Years following the events of The Shining (1980), a now-adult Dan Torrance must protect a young girl with similar powers from a cult known as The True Knot, who prey on children with powers to remain immortal."
# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
html_nodes('.runtime') %>%
html_text() %>%
str_replace(" min", "") %>%
as.integer())
## [1] 148 122 161 113 105 152 132 130 169 181 135 130 95 116 209 101 118 141
## [19] 152 122 128 109 119 98 108 107 108 127 135 127 140 123 97 108 92 116
## [37] 99 94 90 118 109 132 128 122 99 123 104 102 112 100 129 122 118 137
## [55] 121 114 93 106 128 110 109 137 90 120 149 123 86 99 121 99 129 84
## [73] 104 138 97 112 116 101 113 88 119 103 115 115 110 126 120 97 102 125
## [91] 122 111 87 98 89 132 90 119 123 107
# Using CSS selectors to scrap the Movie runtime section
runtime_data_html <- html_nodes(webpage, '.runtime')
# Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)
# Let's have a look at the runtime
head(runtime_data)
## [1] "148 min" "122 min" "161 min" "113 min" "105 min" "152 min"
# Data-Preprocessing: removing mins and converting it to numerical
runtime_data <- str_replace(runtime_data, " min", "")
runtime_data <- as.numeric(runtime_data)
#Let's have another look at the runtime data
head(runtime_data)
## [1] 148 122 161 113 105 152
Collect the (first) genre of each movie:
# Using CSS selectors to scrap the Movie genre section
genre_data_html <- html_nodes(webpage, '.genre')
# Converting the genre data to text
genre_data <- html_text(genre_data_html)
# Let's have a look at the genre data
head(genre_data)
## [1] "\nDrama, Horror, Mystery "
## [2] "\nCrime, Drama, Thriller "
## [3] "\nComedy, Drama "
## [4] "\nAction, Crime "
## [5] "\nDrama, Romance "
## [6] "\nDrama, Fantasy, Horror "
# Data-Preprocessing: retrieve the first word
genre_data <- str_extract(genre_data, "[:alpha:]+")
# Convering each genre from text to factor
#genre_data <- as.factor(genre_data)
# Let's have another look at the genre data
head(genre_data)
## [1] "Drama" "Crime" "Comedy" "Action" "Drama" "Drama"
# Using CSS selectors to scrap the IMDB rating section
rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong')
# Converting the ratings data to text
rating_data <- html_text(rating_data_html)
# Let's have a look at the ratings
head(rating_data)
## [1] "7.1" "8.4" "7.6" "7.8" "5.3" "7.3"
# Data-Preprocessing: converting ratings to numerical
rating_data <- as.numeric(rating_data)
# Let's have another look at the ratings data
rating_data
## [1] 7.1 8.4 7.6 7.8 5.3 7.3 8.5 7.9 6.5 8.4 7.8 7.4 6.9 6.8 7.8 5.7 6.8 6.4
## [19] 8.1 7.3 6.9 7.4 8.2 6.8 7.9 5.4 6.2 7.1 7.4 6.0 7.3 6.8 5.9 6.0 6.3 7.2
## [37] 5.6 7.0 6.1 6.6 6.8 7.0 6.2 7.3 6.4 6.5 6.5 7.1 7.3 7.7 6.6 8.1 6.3 6.5
## [55] 7.3 5.6 5.3 5.9 6.1 2.8 6.9 7.9 5.7 5.2 6.4 6.7 5.8 6.7 6.4 6.4 7.4 6.7
## [73] 5.5 6.7 7.6 6.8 6.8 5.5 5.7 5.7 6.2 6.8 5.4 6.2 6.3 7.6 7.7 6.0 6.2 6.5
## [91] 7.4 6.2 6.1 6.3 6.1 6.0 5.3 6.6 6.8 6.6
# Using CSS selectors to scrap the votes section
votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)')
# Converting the votes data to text
votes_data <- html_text(votes_data_html)
# Let's have a look at the votes data
head(votes_data)
## [1] "372,304" "1,408,366" "803,311" "372,302" "62,170" "208,182"
# Data-Preprocessing: removing commas
votes_data <- str_replace(votes_data, ",", "")
# Data-Preprocessing: converting votes to numerical
votes_data <- as.numeric(votes_data)
## Warning: NAs introduced by coercion
#Let's have another look at the votes data
votes_data
## [1] 372304 NA 803311 372302 62170 208182 893174 742953 289434 NA
## [11] 229747 407425 171717 325094 414523 96249 259977 478069 439212 285250
## [21] 282516 240363 644457 62334 422985 3055 82424 13324 305220 10201
## [31] 143582 592293 69371 53764 33780 78213 56839 256794 104759 89617
## [41] 123159 373793 188266 278589 136342 253215 177562 127299 52248 269826
## [51] 259135 103150 96112 228884 187816 142620 55545 83594 190551 54441
## [61] 19837 333439 55003 95563 25882 271166 43573 193650 106301 87179
## [71] 533361 43784 83397 91719 98226 44820 161742 62108 198729 6293
## [81] 73941 186571 17063 58550 105283 96254 142596 161276 38217 139877
## [91] 60852 53743 91163 66940 106122 195265 49210 114521 38359 74043
# Using CSS selectors to scrap the directors section
(directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)'))
## {xml_nodeset (100)}
## [1] <a href="/name/nm4170048/?ref_=adv_li_dr_0">Ari Aster</a>
## [2] <a href="/name/nm0680846/?ref_=adv_li_dr_0">Todd Phillips</a>
## [3] <a href="/name/nm0000233/?ref_=adv_li_dr_0">Quentin Tarantino</a>
## [4] <a href="/name/nm0005363/?ref_=adv_li_dr_0">Guy Ritchie</a>
## [5] <a href="/name/nm1788310/?ref_=adv_li_dr_0">Jenny Gage</a>
## [6] <a href="/name/nm1093039/?ref_=adv_li_dr_0">Mike Flanagan</a>
## [7] <a href="/name/nm0094435/?ref_=adv_li_dr_0">Bong Joon Ho</a>
## [8] <a href="/name/nm0426059/?ref_=adv_li_dr_0">Rian Johnson</a>
## [9] <a href="/name/nm0615592/?ref_=adv_li_dr_0">Andy Muschietti</a>
## [10] <a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>
## [11] <a href="/name/nm1950086/?ref_=adv_li_dr_0">Greta Gerwig</a>
## [12] <a href="/name/nm0821432/?ref_=adv_li_dr_0">Chad Stahelski</a>
## [13] <a href="/name/nm2366012/?ref_=adv_li_dr_0">Matt Bettinelli-Olpin</a>
## [14] <a href="/name/nm1443502/?ref_=adv_li_dr_0">Jordan Peele</a>
## [15] <a href="/name/nm0000217/?ref_=adv_li_dr_0">Martin Scorsese</a>
## [16] <a href="/name/nm1556116/?ref_=adv_li_dr_0">Kevin Kölsch</a>
## [17] <a href="/name/nm0269463/?ref_=adv_li_dr_0">Jon Favreau</a>
## [18] <a href="/name/nm0009190/?ref_=adv_li_dr_0">J.J. Abrams</a>
## [19] <a href="/name/nm0003506/?ref_=adv_li_dr_0">James Mangold</a>
## [20] <a href="/name/nm0001675/?ref_=adv_li_dr_0">Robert Rodriguez</a>
## ...
# Converting the directors data to text
directors_data <- html_text(directors_data_html)
# Let's have a look at the directors data
directors_data
## [1] "Ari Aster" "Todd Phillips" "Quentin Tarantino"
## [4] "Guy Ritchie" "Jenny Gage" "Mike Flanagan"
## [7] "Bong Joon Ho" "Rian Johnson" "Andy Muschietti"
## [10] "Anthony Russo" "Greta Gerwig" "Chad Stahelski"
## [13] "Matt Bettinelli-Olpin" "Jordan Peele" "Martin Scorsese"
## [16] "Kevin Kölsch" "Jon Favreau" "J.J. Abrams"
## [19] "James Mangold" "Robert Rodriguez" "Guy Ritchie"
## [22] "Robert Eggers" "Sam Mendes" "Adam Randall"
## [25] "Taika Waititi" "Lou Jeunet" "André Øvredal"
## [28] "May el-Toukhy" "Benny Safdie" "Egor Baranov"
## [31] "David Michôd" "Anna Boden" "Lorcan Finnegan"
## [34] "Severin Fiala" "Scott Beck" "Justin Baldoni"
## [37] "Tate Taylor" "Galder Gaztelu-Urrutia" "David Yarovesky"
## [40] "Luc Besson" "Jay Roach" "David F. Sandberg"
## [43] "Tim Miller" "Vince Gilligan" "Adam Robitel"
## [46] "James Gray" "Rob Letterman" "Olivia Wilde"
## [49] "Gavin Hood" "Josh Cooley" "M. Night Shyamalan"
## [52] "Céline Sciamma" "Jonas Åkerlund" "David Leitch"
## [55] "Dexter Fletcher" "F. Gary Gray" "Michael Chaves"
## [58] "Gary Dauberman" "Michael Bay" "Tom Hooper"
## [61] "Won-Tae Lee" "Noah Baumbach" "Lars Klevberg"
## [64] "Neil Marshall" "John Crowley" "Jake Kasdan"
## [67] "Greg Tiernan" "Ruben Fleischer" "Ric Roman Waugh"
## [70] "Brad Anderson" "Jon Watts" "Rose Glass"
## [73] "Jim Jarmusch" "Roland Emmerich" "Tyler Nilson"
## [76] "Dome Karukoski" "Danny Boyle" "Vincenzo Natali"
## [79] "Simon Kinberg" "Mike Gan" "Hans Petter Moland"
## [82] "Chris Buck" "Rob Zombie" "Jim Mickle"
## [85] "Lorene Scafaria" "Todd Haynes" "Darius Marder"
## [88] "Kyle Newacheck" "Justin Benson" "J.C. Chandor"
## [91] "Michael Engler" "Richard Stanley" "Alexandre Aja"
## [94] "Jason Howden" "Adrian Grunberg" "Michael Dougherty"
## [97] "John R. Leonetti" "Joachim Rønning" "Rod Lurie"
## [100] "Mike Mitchell"
# Using CSS selectors to scrap the actors section
(actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a'))
## {xml_nodeset (100)}
## [1] <a href="/name/nm6073955/?ref_=adv_li_st_0">Florence Pugh</a>
## [2] <a href="/name/nm0001618/?ref_=adv_li_st_0">Joaquin Phoenix</a>
## [3] <a href="/name/nm0000138/?ref_=adv_li_st_0">Leonardo DiCaprio</a>
## [4] <a href="/name/nm0000190/?ref_=adv_li_st_0">Matthew McConaughey</a>
## [5] <a href="/name/nm6466214/?ref_=adv_li_st_0">Josephine Langford</a>
## [6] <a href="/name/nm0000191/?ref_=adv_li_st_0">Ewan McGregor</a>
## [7] <a href="/name/nm0814280/?ref_=adv_li_st_0">Song Kang-ho</a>
## [8] <a href="/name/nm0185819/?ref_=adv_li_st_0">Daniel Craig</a>
## [9] <a href="/name/nm1567113/?ref_=adv_li_st_0">Jessica Chastain</a>
## [10] <a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>
## [11] <a href="/name/nm1519680/?ref_=adv_li_st_0">Saoirse Ronan</a>
## [12] <a href="/name/nm0000206/?ref_=adv_li_st_0">Keanu Reeves</a>
## [13] <a href="/name/nm3034977/?ref_=adv_li_st_0">Samara Weaving</a>
## [14] <a href="/name/nm2143282/?ref_=adv_li_st_0">Lupita Nyong'o</a>
## [15] <a href="/name/nm0000134/?ref_=adv_li_st_0">Robert De Niro</a>
## [16] <a href="/name/nm0164809/?ref_=adv_li_st_0">Jason Clarke</a>
## [17] <a href="/name/nm2255973/?ref_=adv_li_st_0">Donald Glover</a>
## [18] <a href="/name/nm5397459/?ref_=adv_li_st_0">Daisy Ridley</a>
## [19] <a href="/name/nm0000354/?ref_=adv_li_st_0">Matt Damon</a>
## [20] <a href="/name/nm4023073/?ref_=adv_li_st_0">Rosa Salazar</a>
## ...
# Converting the gross actors data to text
actors_data <- html_text(actors_data_html)
# Let's have a look at the actors data
head(actors_data)
## [1] "Florence Pugh" "Joaquin Phoenix" "Leonardo DiCaprio"
## [4] "Matthew McConaughey" "Josephine Langford" "Ewan McGregor"
Be careful with missing data.
# Using CSS selectors to scrap the metascore section
metascore_data_html <- html_nodes(webpage, '.metascore')
# Converting the runtime data to text
metascore_data <- html_text(metascore_data_html)
# Let's have a look at the metascore
head(metascore_data)
## [1] "72 " "59 " "83 " "51 " "30 "
## [6] "59 "
# Data-Preprocessing: removing extra space in metascore
metascore_data <- str_replace(metascore_data, "\\s*$", "")
metascore_data <- as.numeric(metascore_data)
metascore_data
## [1] 72 59 83 51 30 59 96 82 58 78 91 73 64 81 94 57 55 53 81 53 53 83 78 65 58
## [26] 61 67 92 62 64 64 64 69 53 53 73 44 40 64 71 54 72 48 80 53 84 63 84 43 95
## [51] 19 60 69 38 41 53 41 32 65 94 48 31 40 58 46 55 45 36 69 83 53 47 70 48 55
## [76] 46 43 50 57 64 50 48 79 73 82 38 64 61 64 70 60 42 26 48 25 43 71 65
# Lets check the length of metascore data
length(metascore_data)
## [1] 98
# Visual inspection finds 24, 85, 100 don't have metascore
ms <- rep(NA, 100)
ms[-c(24, 85, 100)] <- metascore_data
## Warning in ms[-c(24, 85, 100)] <- metascore_data: number of items to replace is
## not a multiple of replacement length
(metascore_data <- ms)
## [1] 72 59 83 51 30 59 96 82 58 78 91 73 64 81 94 57 55 53 81 53 53 83 78 NA 65
## [26] 58 61 67 92 62 64 64 64 69 53 53 73 44 40 64 71 54 72 48 80 53 84 63 84 43
## [51] 95 19 60 69 38 41 53 41 32 65 94 48 31 40 58 46 55 45 36 69 83 53 47 70 48
## [76] 55 46 43 50 57 64 50 48 79 NA 73 82 38 64 61 64 70 60 42 26 48 25 43 71 NA
Be careful with missing data.
# Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
# Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)
# Let's have a look at the gross data
head(gross_data)
## [1] "$27.33M" "$335.45M" "$142.50M" "$36.47M" "$12.14M" "$31.58M"
# Data-Preprocessing: removing '$' and 'M' signs
gross_data <- str_replace(gross_data, "M", "")
gross_data <- str_sub(gross_data, 2, 10)
#(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
gross_data <- as.numeric(gross_data)
# Let's check the length of gross data
length(gross_data)
## [1] 70
# Visual inspection finds below movies don't have gross
#gs_data <- rep(NA, 100)
#gs_data[-c(1, 2, 3, 5, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data
#(gross_data <- gs_data)
60 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.
(rank_and_gross <- webpage %>%
html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
html_text() %>%
str_replace("\\s+", "") %>%
str_replace_all("[$M]", ""))
## [1] "1." "27.33" "2." "335.45" "3." "142.50" "4." "36.47"
## [9] "5." "12.14" "6." "31.58" "7." "53.37" "8." "165.36"
## [17] "9." "211.59" "10." "858.37" "11." "108.10" "12." "171.02"
## [25] "13." "28.71" "14." "175.08" "15." "7.00" "16." "54.72"
## [33] "17." "543.64" "18." "515.20" "19." "117.62" "20." "85.71"
## [41] "21." "355.56" "22." "0.43" "23." "159.23" "24." "25."
## [49] "33.37" "26." "27." "68.95" "28." "29." "30." "31."
## [57] "32." "426.83" "33." "34." "35." "36." "45.73" "37."
## [65] "45.37" "38." "39." "17.30" "40." "7.74" "41." "42."
## [73] "140.37" "43." "62.25" "44." "45." "57.01" "46." "50.19"
## [81] "47." "144.11" "48." "22.68" "49." "0.40" "50." "434.04"
## [89] "51." "111.05" "52." "3.76" "53." "54." "173.96" "55."
## [97] "96.37" "56." "80.00" "57." "54.73" "58." "74.15" "59."
## [105] "60." "61." "0.22" "62." "2.00" "63." "29.21" "64."
## [113] "21.90" "65." "5.33" "66." "316.83" "67." "100.04" "68."
## [121] "73.12" "69." "69.03" "70." "71." "390.53" "72." "73."
## [129] "6.56" "74." "56.85" "75." "13.12" "76." "4.54" "77."
## [137] "73.29" "78." "79." "65.85" "80." "81." "32.14" "82."
## [145] "477.37" "83." "84." "85." "104.96" "86." "87." "88."
## [153] "89." "90." "91." "96.85" "92." "93." "39.01" "94."
## [161] "95." "44.82" "96." "110.50" "97." "98." "113.93" "99."
## [169] "100." "105.81"
isrank <- str_detect(rank_and_gross, "\\.$")
ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_gross[ismissing])
gs_data <- rep(NA, 100)
gs_data[-missingpos] <- gross_data
(gross_data <- gs_data)
## [1] 27.33 335.45 142.50 36.47 12.14 31.58 53.37 165.36 211.59 858.37
## [11] 108.10 171.02 28.71 175.08 7.00 54.72 543.64 515.20 117.62 85.71
## [21] 355.56 0.43 159.23 NA 33.37 NA 68.95 NA NA NA
## [31] NA 426.83 NA NA NA 45.73 45.37 NA 17.30 7.74
## [41] NA 140.37 62.25 NA 57.01 50.19 144.11 22.68 0.40 434.04
## [51] 111.05 3.76 NA 173.96 96.37 80.00 54.73 74.15 NA NA
## [61] 0.22 2.00 29.21 21.90 5.33 316.83 100.04 73.12 69.03 NA
## [71] 390.53 NA 6.56 56.85 13.12 4.54 73.29 NA 65.85 NA
## [81] 32.14 477.37 NA NA 104.96 NA NA NA NA NA
## [91] 96.85 NA 39.01 NA 44.82 110.50 NA 113.93 NA 105.81
Following code programatically figures out missing entries for metascore.
# Use CSS selectors to scrap the rankings section
(rank_metascore_data_html <- html_nodes(webpage, '.unfavorable , .favorable , .mixed , .text-primary'))
## {xml_nodeset (198)}
## [1] <span class="lister-item-index unbold text-primary">1.</span>
## [2] <span class="metascore favorable">72 </span>
## [3] <span class="lister-item-index unbold text-primary">2.</span>
## [4] <span class="metascore mixed">59 </span>
## [5] <span class="lister-item-index unbold text-primary">3.</span>
## [6] <span class="metascore favorable">83 </span>
## [7] <span class="lister-item-index unbold text-primary">4.</span>
## [8] <span class="metascore mixed">51 </span>
## [9] <span class="lister-item-index unbold text-primary">5.</span>
## [10] <span class="metascore unfavorable">30 </span>
## [11] <span class="lister-item-index unbold text-primary">6.</span>
## [12] <span class="metascore mixed">59 </span>
## [13] <span class="lister-item-index unbold text-primary">7.</span>
## [14] <span class="metascore favorable">96 </span>
## [15] <span class="lister-item-index unbold text-primary">8.</span>
## [16] <span class="metascore favorable">82 </span>
## [17] <span class="lister-item-index unbold text-primary">9.</span>
## [18] <span class="metascore mixed">58 </span>
## [19] <span class="lister-item-index unbold text-primary">10.</span>
## [20] <span class="metascore favorable">78 </span>
## ...
# Convert the ranking data to text
(rank_metascore_data <- html_text(rank_metascore_data_html))
## [1] "1." "72 " "2." "59 " "3."
## [6] "83 " "4." "51 " "5." "30 "
## [11] "6." "59 " "7." "96 " "8."
## [16] "82 " "9." "58 " "10." "78 "
## [21] "11." "91 " "12." "73 " "13."
## [26] "64 " "14." "81 " "15." "94 "
## [31] "16." "57 " "17." "55 " "18."
## [36] "53 " "19." "81 " "20." "53 "
## [41] "21." "53 " "22." "83 " "23."
## [46] "78 " "24." "65 " "25." "58 "
## [51] "26." "27." "61 " "28." "67 "
## [56] "29." "92 " "30." "31." "62 "
## [61] "32." "64 " "33." "64 " "34."
## [66] "64 " "35." "69 " "36." "53 "
## [71] "37." "53 " "38." "73 " "39."
## [76] "44 " "40." "40 " "41." "64 "
## [81] "42." "71 " "43." "54 " "44."
## [86] "72 " "45." "48 " "46." "80 "
## [91] "47." "53 " "48." "84 " "49."
## [96] "63 " "50." "84 " "51." "43 "
## [101] "52." "95 " "53." "19 " "54."
## [106] "60 " "55." "69 " "56." "38 "
## [111] "57." "41 " "58." "53 " "59."
## [116] "41 " "60." "32 " "61." "65 "
## [121] "62." "94 " "63." "48 " "64."
## [126] "31 " "65." "40 " "66." "58 "
## [131] "67." "46 " "68." "55 " "69."
## [136] "45 " "70." "36 " "71." "69 "
## [141] "72." "83 " "73." "53 " "74."
## [146] "47 " "75." "70 " "76." "48 "
## [151] "77." "55 " "78." "46 " "79."
## [156] "43 " "80." "50 " "81." "57 "
## [161] "82." "64 " "83." "50 " "84."
## [166] "48 " "85." "79 " "86." "73 "
## [171] "87." "82 " "88." "38 " "89."
## [176] "64 " "90." "61 " "91." "64 "
## [181] "92." "70 " "93." "60 " "94."
## [186] "42 " "95." "26 " "96." "48 "
## [191] "97." "25 " "98." "43 " "99."
## [196] "71 " "100." "65 "
# Strip spaces
(rank_metascore_data <- str_replace(rank_metascore_data, "\\s+", ""))
## [1] "1." "72" "2." "59" "3." "83" "4." "51" "5." "30"
## [11] "6." "59" "7." "96" "8." "82" "9." "58" "10." "78"
## [21] "11." "91" "12." "73" "13." "64" "14." "81" "15." "94"
## [31] "16." "57" "17." "55" "18." "53" "19." "81" "20." "53"
## [41] "21." "53" "22." "83" "23." "78" "24." "65" "25." "58"
## [51] "26." "27." "61" "28." "67" "29." "92" "30." "31." "62"
## [61] "32." "64" "33." "64" "34." "64" "35." "69" "36." "53"
## [71] "37." "53" "38." "73" "39." "44" "40." "40" "41." "64"
## [81] "42." "71" "43." "54" "44." "72" "45." "48" "46." "80"
## [91] "47." "53" "48." "84" "49." "63" "50." "84" "51." "43"
## [101] "52." "95" "53." "19" "54." "60" "55." "69" "56." "38"
## [111] "57." "41" "58." "53" "59." "41" "60." "32" "61." "65"
## [121] "62." "94" "63." "48" "64." "31" "65." "40" "66." "58"
## [131] "67." "46" "68." "55" "69." "45" "70." "36" "71." "69"
## [141] "72." "83" "73." "53" "74." "47" "75." "70" "76." "48"
## [151] "77." "55" "78." "46" "79." "43" "80." "50" "81." "57"
## [161] "82." "64" "83." "50" "84." "48" "85." "79" "86." "73"
## [171] "87." "82" "88." "38" "89." "64" "90." "61" "91." "64"
## [181] "92." "70" "93." "60" "94." "42" "95." "26" "96." "48"
## [191] "97." "25" "98." "43" "99." "71" "100." "65"
# a rank followed by another rank means the metascore for the 1st rank is missing
(isrank <- str_detect(rank_metascore_data, "\\.$"))
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [25] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [37] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [49] TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## [61] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [73] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [85] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [97] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [109] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [121] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [133] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [145] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [157] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [169] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [181] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [193] TRUE FALSE TRUE FALSE TRUE FALSE
ismissing <- isrank[1:length(rank_metascore_data)-1] &
isrank[2:length(rank_metascore_data)]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
(missingpos <- as.integer(rank_metascore_data[ismissing]))
## [1] 26 30
#(rank_metascore_data <- as.integer(rank_metascore_data))
You (students) should work out the code for finding missing positions for gross.
Form a tibble:
# Combining all the lists to form a data frame
movies <- tibble(Rank = rank_data,
Title = title_data,
Description = description_data,
Runtime = runtime_data,
Genre = genre_data,
Rating = rating_data,
Metascore = metascore_data,
Votes = votes_data,
Gross_Earning_in_Mil = gross_data,
Director = directors_data,
Actor = actors_data)
movies %>% print(width=Inf)
## # A tibble: 100 × 11
## Rank Title
## <int> <chr>
## 1 1 Midsommar
## 2 2 Joker
## 3 3 Once Upon a Time in Hollywood
## 4 4 The Gentlemen
## 5 5 After
## 6 6 Doctor Sleep
## 7 7 Parasite
## 8 8 Knives Out
## 9 9 It Chapter Two
## 10 10 Avengers: Endgame
## Description
## <chr>
## 1 A couple travels to Northern Europe to visit a rural hometown's fabled Swedi…
## 2 During the 1980s, a failed stand-up comedian is driven insane and turns to a…
## 3 A faded television actor and his stunt double strive to achieve fame and suc…
## 4 An American expat tries to sell off his highly profitable marijuana empire i…
## 5 A young woman falls for a guy with a dark secret and the two embark on a roc…
## 6 Years following the events of The Shining (1980), a now-adult Dan Torrance m…
## 7 Greed and class discrimination threaten the newly formed symbiotic relations…
## 8 A detective investigates the death of the patriarch of an eccentric, combati…
## 9 Twenty-seven years after their first encounter with the terrifying Pennywise…
## 10 After the devastating events of Avengers: Infinity War (2018), the universe …
## Runtime Genre Rating Metascore Votes Gross_Earning_in_Mil Director
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 148 Drama 7.1 72 372304 27.3 Ari Aster
## 2 122 Crime 8.4 59 NA 335. Todd Phillips
## 3 161 Comedy 7.6 83 803311 142. Quentin Tarantino
## 4 113 Action 7.8 51 372302 36.5 Guy Ritchie
## 5 105 Drama 5.3 30 62170 12.1 Jenny Gage
## 6 152 Drama 7.3 59 208182 31.6 Mike Flanagan
## 7 132 Drama 8.5 96 893174 53.4 Bong Joon Ho
## 8 130 Comedy 7.9 82 742953 165. Rian Johnson
## 9 169 Drama 6.5 58 289434 212. Andy Muschietti
## 10 181 Action 8.4 78 NA 858. Anthony Russo
## Actor
## <chr>
## 1 Florence Pugh
## 2 Joaquin Phoenix
## 3 Leonardo DiCaprio
## 4 Matthew McConaughey
## 5 Josephine Langford
## 6 Ewan McGregor
## 7 Song Kang-ho
## 8 Daniel Craig
## 9 Jessica Chastain
## 10 Robert Downey Jr.
## # ℹ 90 more rows
How many top 100 movies are in each genre? (Be careful with interpretation.)
movies %>%
ggplot() +
geom_bar(mapping = aes(x = Genre))
Which genre is most profitable in terms of average gross earnings?
movies %>%
group_by(Genre) %>%
summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm=TRUE)) %>%
ggplot() +
geom_col(mapping = aes(x = Genre, y = avg_earning)) +
labs(y = "avg earning in millions")
ggplot(data = movies) +
geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) +
labs(y = "Gross earning in millions")
## Warning: Removed 30 rows containing non-finite values (`stat_boxplot()`).
Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre
library("ggrepel")
(best_in_genre <- movies %>%
group_by(Genre) %>%
filter(row_number(desc(Gross_Earning_in_Mil)) == 1))
## # A tibble: 8 × 11
## # Groups: Genre [8]
## Rank Title Description Runtime Genre Rating Metascore Votes
## <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2 Joker During the 1980… 122 Crime 8.4 59 NA
## 2 8 Knives Out A detective inv… 130 Come… 7.9 82 742953
## 3 9 It Chapter Two Twenty-seven ye… 169 Drama 6.5 58 289434
## 4 10 Avengers: Endgame After the devas… 181 Acti… 8.4 78 NA
## 5 14 Us A family's sere… 116 Horr… 6.8 81 325094
## 6 17 The Lion King After the murde… 118 Adve… 6.8 55 259977
## 7 55 Rocketman A musical fanta… 121 Biog… 7.3 38 187816
## 8 82 Frozen II Anna, Elsa, Kri… 103 Anim… 6.8 50 186571
## # ℹ 3 more variables: Gross_Earning_in_Mil <dbl>, Director <chr>, Actor <chr>
ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
geom_point(mapping = aes(size = Votes, color = Genre)) +
ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
labs(y = "Gross earning in millions")
## Warning: Removed 32 rows containing missing values (`geom_point()`).
Many websites dynamically pull data from databases using JavasScript and JQuery that make them difficult to scrape.
The FCC’s dtvmaps webpage has a simple form in which you enter a zip code and it gives you the available local TV stations in that zip code and their signal strength.
You’ll also notice the URL stays fixed with different zip codes.
RSelenium loads the page that we want to scrape and download the HTML from that page.
particularly useful when scraping something behind a login
simulate human behavior on a website (e.g., mouse clicking)
rvest provides typical scraping tools
rm(list = ls()) # clean-up workspace
library("RSelenium")
library("tidyverse")
library("rvest")
rD <- rsDriver(browser="firefox", port=sample(1:7360L, 1), verbose=F)
remDr <- rD[["client"]]
Open a webpage
remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")
We want to send a string of text (zip code) into the form.
zip <- "70118"
# remDr$findElement(using = "id", value = "startpoint")$clearElement()
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
# other possible ("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text")
Click on the button Go!
remDr$findElements("id", "btnSub")[[1]]$clickElement()
save HTML to an object
use rvest for the rest
Sys.sleep(5) # give the page time to fully load, in seconds
html <- remDr$getPageSource()[[1]]
# important to close the client
remDr$close()
signals <- read_html(html) %>%
html_nodes("table.tbl_mapReception") %>% # extract table nodes with class = "tbl_mapReception"
.[3] %>% # keep the third of these tables
.[[1]] %>% # keep the first element of this list
html_table(fill=T) # have rvest turn it into a dataframe
signals
## # A tibble: 37 × 6
## Callsign Callsign Network `Ch#` Band IA
## <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 "Click on callsign for detail" "Click on callsign … "Click… "Cli… "Cli… NA
## 2 "" "WWL-TV" "CBS" "4" "UHF" NA
## 3 "" "" "" "" "" NA
## 4 "" "WUPL" "MYNE" "54" "UHF" NA
## 5 "" "" "" "" "" NA
## 6 "" "WPXL-TV" "ION" "49" "UHF" NA
## 7 "" "" "" "" "" NA
## 8 "" "WHNO" "IND" "20" "UHF" NA
## 9 "" "" "" "" "" NA
## 10 "" "WGNO" "ABC" "26" "UHF" NA
## # ℹ 27 more rows
More formatting on signals
names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2") # rename columns
signals <- signals %>%
slice(2:n()) %>% # drop unnecessary first row
filter(callsign != "") %>% # drop blank rows
select(callsign:band) # drop unnecessary columns
signals
## # A tibble: 18 × 4
## callsign network ch_num band
## <chr> <chr> <chr> <chr>
## 1 WWL-TV "CBS" "4" UHF
## 2 WUPL "MYNE" "54" UHF
## 3 WPXL-TV "ION" "49" UHF
## 4 WHNO "IND" "20" UHF
## 5 WGNO "ABC" "26" UHF
## 6 WVUE-DT "FOX" "8" UHF
## 7 WDSU "NBC" "6" UHF
## 8 WNOL-TV "CW" "38" UHF
## 9 KGLA-DT "IND" "42" UHF
## 10 WTNO-CD "" "" UHF
## 11 WYES-TV "PBS" "12" Hi-V
## 12 WLAE-TV "PBS" "32" UHF
## 13 KNOV-CD "" "" UHF
## 14 WBXN-CD "" "" UHF
## 15 WVLA-TV "NBC" "33" UHF
## 16 WBRZ-TV "ABC" "2" Hi-V
## 17 WGMB-TV "FOX" "44" UHF
## 18 WAFB "CBS" "9" Hi-V
Capture all text by clicking on each Callsign
read_html(html) %>%
html_nodes(".callsign") %>%
html_attr("onclick")
## [1] "getdetail(15230,74192,'WWL-TV Facility ID: 74192 <br>WWL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=74192 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/74192 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 27<br>RX Strength: 115 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WWL-TV<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WWL-TV')"
## [2] "getdetail(15231,13938,'WUPL Facility ID: 13938 <br>WUPL (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=13938 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/13938 target=_new>Public File</a>)<br>City of License: SLIDELL, LA<br>RF Channel: 17<br>RX Strength: 114 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WUPL<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WUPL')"
## [3] "getdetail(15800,21729,'WPXL-TV Facility ID: 21729 <br>WPXL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=21729 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/21729 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 33<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 82°','WPXL-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 82 deg',29.9827777777778,-89.9494444444445,'WPXL-TV')"
## [4] "getdetail(16584,37106,'WHNO Facility ID: 37106 <br>WHNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=37106 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/37106 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 21<br>RX Strength: 111 dbuV/m<br>Tower Distance: 6 mi; Direction: 120°','WHNO<br>Distance to Tower: 6 miles<br>Direction to Tower: 120 deg',29.9203055555556,-90.0245833333333,'WHNO')"
## [5] "getdetail(15221,72119,'WGNO Facility ID: 72119 <br>WGNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=72119 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/72119 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 26<br>RX Strength: 111 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WGNO<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WGNO')"
## [6] "getdetail(15232,4149,'WVUE-DT Facility ID: 4149 <br>WVUE-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=4149 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/4149 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 29<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 94°','WVUE-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 94 deg',29.9541388888889,-89.9495277777778,'WVUE-DT')"
## [7] "getdetail(15212,71357,'WDSU Facility ID: 71357 <br>WDSU (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=71357 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/71357 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 19<br>RX Strength: 110 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WDSU<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WDSU')"
## [8] "getdetail(15220,54280,'WNOL-TV Facility ID: 54280 <br>WNOL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=54280 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/54280 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 15<br>RX Strength: 110 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WNOL-TV<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WNOL-TV')"
## [9] "getdetail(15195,83945,'KGLA-DT Facility ID: 83945 <br>KGLA-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=83945 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/83945 target=_new>Public File</a>)<br>City of License: HAMMOND, LA<br>RF Channel: 35<br>RX Strength: 110 dbuV/m<br>Tower Distance: 10 mi; Direction: 84°','KGLA-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 84 deg',29.9783333333333,-89.9405555555556,'KGLA-DT')"
## [10] "getdetail(16703,24981,'WTNO-CD Facility ID: 24981 <br>WTNO-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=24981 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/24981 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 22<br>RX Strength: 109 dbuV/m<br>Tower Distance: 2 mi; Direction: 292°','WTNO-CD<br>Distance to Tower: 2 miles<br>Direction to Tower: 292 deg',29.9746111111111,-90.1434722222222,'WTNO-CD')"
## [11] "getdetail(16332,25090,'WYES-TV Facility ID: 25090 <br>WYES-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=25090 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/25090 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 11<br>RX Strength: 101 dbuV/m<br>Tower Distance: 10 mi; Direction: 94°','WYES-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 94 deg',29.9538888888889,-89.9494444444445,'WYES-TV')"
## [12] "getdetail(15857,18819,'WLAE-TV Facility ID: 18819 <br>WLAE-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=18819 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/18819 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 23<br>RX Strength: 105 dbuV/m<br>Tower Distance: 10 mi; Direction: 82°','WLAE-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 82 deg',29.9827777777778,-89.9525,'WLAE-TV')"
## [13] "getdetail(16837,64048,'KNOV-CD Facility ID: 64048 <br>KNOV-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=64048 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/64048 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 31<br>RX Strength: 102 dbuV/m<br>Tower Distance: 3 mi; Direction: 107°','KNOV-CD<br>Distance to Tower: 3 miles<br>Direction to Tower: 107 deg',29.9521388888889,-90.0702777777778,'KNOV-CD')"
## [14] "getdetail(16816,70419,'WBXN-CD Facility ID: 70419 <br>WBXN-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70419 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70419 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 36<br>RX Strength: 92 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WBXN-CD<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WBXN-CD')"
## [15] "getdetail(16563,70021,'WVLA-TV Facility ID: 70021 <br>WVLA-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70021 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70021 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 34<br>RX Strength: 53 dbuV/m<br>Tower Distance: 74 mi; Direction: 290°','WVLA-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 290 deg',30.3262777777778,-91.2766944444444,'WVLA-TV')"
## [16] "getdetail(16251,38616,'WBRZ-TV Facility ID: 38616 <br>WBRZ-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38616 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38616 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 13<br>RX Strength: 46 dbuV/m<br>Tower Distance: 69 mi; Direction: 290°','WBRZ-TV<br>Distance to Tower: 69 miles<br>Direction to Tower: 290 deg',30.2969444444444,-91.1936111111111,'WBRZ-TV')"
## [17] "getdetail(15727,12520,'WGMB-TV Facility ID: 12520 <br>WGMB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=12520 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/12520 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 24<br>RX Strength: 50 dbuV/m<br>Tower Distance: 74 mi; Direction: 290°','WGMB-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 290 deg',30.3262777777778,-91.2766944444444,'WGMB-TV')"
## [18] "getdetail(16368,589,'WAFB Facility ID: 589 <br>WAFB (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=589 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/589 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 9<br>RX Strength: 38 dbuV/m<br>Tower Distance: 71 mi; Direction: 293°','WAFB<br>Distance to Tower: 71 miles<br>Direction to Tower: 293 deg',30.3663888888889,-91.2130555555556,'WAFB')"
Extract signal by string operations
strength <- read_html(html) %>%
html_nodes(".callsign") %>%
html_attr("onclick") %>%
str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")
# (?<=…) is a special regex expression for positive lookbehind
signals <- cbind(signals, strength)
signals
## callsign network ch_num band strength
## 1 WWL-TV CBS 4 UHF 115
## 2 WUPL MYNE 54 UHF 114
## 3 WPXL-TV ION 49 UHF 111
## 4 WHNO IND 20 UHF 111
## 5 WGNO ABC 26 UHF 111
## 6 WVUE-DT FOX 8 UHF 111
## 7 WDSU NBC 6 UHF 110
## 8 WNOL-TV CW 38 UHF 110
## 9 KGLA-DT IND 42 UHF 110
## 10 WTNO-CD UHF 109
## 11 WYES-TV PBS 12 Hi-V 101
## 12 WLAE-TV PBS 32 UHF 105
## 13 KNOV-CD UHF 102
## 14 WBXN-CD UHF 92
## 15 WVLA-TV NBC 33 UHF 53
## 16 WBRZ-TV ABC 2 Hi-V 46
## 17 WGMB-TV FOX 44 UHF 50
## 18 WAFB CBS 9 Hi-V 38