Announcement

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33   R6_2.5.1        fastmap_1.1.1   xfun_0.39      
##  [5] cachem_1.0.8    knitr_1.42      htmltools_0.5.5 rmarkdown_2.21 
##  [9] cli_3.6.1       sass_0.4.6      jquerylib_0.1.4 compiler_4.3.0 
## [13] rstudioapi_0.14 tools_4.3.0     evaluate_0.20   bslib_0.4.2    
## [17] yaml_2.3.7      rlang_1.1.1     jsonlite_1.8.4

Acknowledgement

Dr. Hua Zhou’s slides

Josh McCrain’s RSelenium tutorial

HTML Introduction from GeeksforGeeks

Getting started with HTML MDN Web Docs

Load tidyverse and other packages for this lecture:

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("rvest")
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

HTML introduction

We cover some survival amount of instroduction of HTML format first.

HTML stands for HyperText Markup Language.

Elements and Tags

  • HTML uses predefined tags and elements which tell the browser how to properly display the content.

  • Remember to include closing tags. If omitted, the browser applies the effect of the opening tag until the end of the page.

Attributes

Elements can also have attributes. Attributes look like this:

  • Attributes contain extra information about the element that won’t appear in the content.

  • In this example, the class attribute is an identifying name used to target the element with style information.

An attribute should have:

  • A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)

  • The attribute name, followed by an equal sign.

  • An attribute value, wrapped with opening and closing quote marks.

Anchors

Another example of an element is <a>. This stands for anchor. An anchor can make the text it encloses into a hyperlink. Anchors can take a number of attributes, but several are as follows:

  • href: This attribute’s value specifies the web address for the link. For example: href="https://www.mozilla.org/".

  • title: The title attribute specifies extra information about the link, such as a description of the page that is being linked to. For example, title="The Mozilla homepage". This appears as a tooltip when a cursor hovers over the element.

  • target: The target attribute specifies the browsing context used to display the link. For example, target="_blank" will display the link in a new tab. If you want to display the linked content in the current tab, just omit this attribute.

HTML page structure

  • The basic structure of an HTML page is laid out below.

  • It contains the essential building-block elements upon which all web pages are created.

    • doctype declaration

    • HTML

    • head

    • title

    • body elements

HTML comments

To write an HTML comment, wrap it in the special markers <!-- and -->. For example:

<p>I'm not inside a comment</p>

<!-- <p>I am!</p> -->

generates:

I’m not inside a comment


Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

Example: Scraping from webpage

Rank

  • Use SelectorGadget to highlight the element we want to scrape

  • Use the CSS selector to get the rankings

    # Use CSS selectors to scrap the rankings section
    (rank_data_html <- html_nodes(webpage, '.text-primary'))
    ## {xml_nodeset (100)}
    ##  [1] <span class="lister-item-index unbold text-primary">1.</span>
    ##  [2] <span class="lister-item-index unbold text-primary">2.</span>
    ##  [3] <span class="lister-item-index unbold text-primary">3.</span>
    ##  [4] <span class="lister-item-index unbold text-primary">4.</span>
    ##  [5] <span class="lister-item-index unbold text-primary">5.</span>
    ##  [6] <span class="lister-item-index unbold text-primary">6.</span>
    ##  [7] <span class="lister-item-index unbold text-primary">7.</span>
    ##  [8] <span class="lister-item-index unbold text-primary">8.</span>
    ##  [9] <span class="lister-item-index unbold text-primary">9.</span>
    ## [10] <span class="lister-item-index unbold text-primary">10.</span>
    ## [11] <span class="lister-item-index unbold text-primary">11.</span>
    ## [12] <span class="lister-item-index unbold text-primary">12.</span>
    ## [13] <span class="lister-item-index unbold text-primary">13.</span>
    ## [14] <span class="lister-item-index unbold text-primary">14.</span>
    ## [15] <span class="lister-item-index unbold text-primary">15.</span>
    ## [16] <span class="lister-item-index unbold text-primary">16.</span>
    ## [17] <span class="lister-item-index unbold text-primary">17.</span>
    ## [18] <span class="lister-item-index unbold text-primary">18.</span>
    ## [19] <span class="lister-item-index unbold text-primary">19.</span>
    ## [20] <span class="lister-item-index unbold text-primary">20.</span>
    ## ...
    # (rank_data_html <- html_nodes(webpage, '.lister-item-content .text-primary'))
    # Convert the ranking data to text
    (rank_data <- html_text(rank_data_html))
    ##   [1] "1."   "2."   "3."   "4."   "5."   "6."   "7."   "8."   "9."   "10." 
    ##  [11] "11."  "12."  "13."  "14."  "15."  "16."  "17."  "18."  "19."  "20." 
    ##  [21] "21."  "22."  "23."  "24."  "25."  "26."  "27."  "28."  "29."  "30." 
    ##  [31] "31."  "32."  "33."  "34."  "35."  "36."  "37."  "38."  "39."  "40." 
    ##  [41] "41."  "42."  "43."  "44."  "45."  "46."  "47."  "48."  "49."  "50." 
    ##  [51] "51."  "52."  "53."  "54."  "55."  "56."  "57."  "58."  "59."  "60." 
    ##  [61] "61."  "62."  "63."  "64."  "65."  "66."  "67."  "68."  "69."  "70." 
    ##  [71] "71."  "72."  "73."  "74."  "75."  "76."  "77."  "78."  "79."  "80." 
    ##  [81] "81."  "82."  "83."  "84."  "85."  "86."  "87."  "88."  "89."  "90." 
    ##  [91] "91."  "92."  "93."  "94."  "95."  "96."  "97."  "98."  "99."  "100."
    # Turn into numerical values
    (rank_data <- as.integer(rank_data))
    ##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
    ##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
    ##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
    ##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
    ##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
    ##  [91]  91  92  93  94  95  96  97  98  99 100

Title

  • Use SelectorGadget to find the CSS selector .lister-item-header a.

  • CSS selector reference

    # Using CSS selectors to scrap the title section
    (title_data_html <- html_nodes(webpage, '.lister-item-header a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/title/tt8772262/?ref_=adv_li_tt">Midsommar</a>
    ##  [2] <a href="/title/tt7286456/?ref_=adv_li_tt">Joker</a>
    ##  [3] <a href="/title/tt7131622/?ref_=adv_li_tt">Once Upon a Time in Hollywood ...
    ##  [4] <a href="/title/tt8367814/?ref_=adv_li_tt">The Gentlemen</a>
    ##  [5] <a href="/title/tt4126476/?ref_=adv_li_tt">After</a>
    ##  [6] <a href="/title/tt5606664/?ref_=adv_li_tt">Doctor Sleep</a>
    ##  [7] <a href="/title/tt6751668/?ref_=adv_li_tt">Parasite</a>
    ##  [8] <a href="/title/tt8946378/?ref_=adv_li_tt">Knives Out</a>
    ##  [9] <a href="/title/tt7349950/?ref_=adv_li_tt">It Chapter Two</a>
    ## [10] <a href="/title/tt4154796/?ref_=adv_li_tt">Avengers: Endgame</a>
    ## [11] <a href="/title/tt3281548/?ref_=adv_li_tt">Little Women</a>
    ## [12] <a href="/title/tt6146586/?ref_=adv_li_tt">John Wick: Chapter 3 - Parabe ...
    ## [13] <a href="/title/tt7798634/?ref_=adv_li_tt">Ready or Not</a>
    ## [14] <a href="/title/tt6857112/?ref_=adv_li_tt">Us</a>
    ## [15] <a href="/title/tt1302006/?ref_=adv_li_tt">The Irishman</a>
    ## [16] <a href="/title/tt0837563/?ref_=adv_li_tt">Pet Sematary</a>
    ## [17] <a href="/title/tt6105098/?ref_=adv_li_tt">The Lion King</a>
    ## [18] <a href="/title/tt2527338/?ref_=adv_li_tt">Star Wars: The Rise Of Skywal ...
    ## [19] <a href="/title/tt1950186/?ref_=adv_li_tt">Ford v Ferrari</a>
    ## [20] <a href="/title/tt0437086/?ref_=adv_li_tt">Alita: Battle Angel</a>
    ## ...
    # Converting the title data to text
    (title_data <- html_text(title_data_html))
    ##   [1] "Midsommar"                            
    ##   [2] "Joker"                                
    ##   [3] "Once Upon a Time in Hollywood"        
    ##   [4] "The Gentlemen"                        
    ##   [5] "After"                                
    ##   [6] "Doctor Sleep"                         
    ##   [7] "Parasite"                             
    ##   [8] "Knives Out"                           
    ##   [9] "It Chapter Two"                       
    ##  [10] "Avengers: Endgame"                    
    ##  [11] "Little Women"                         
    ##  [12] "John Wick: Chapter 3 - Parabellum"    
    ##  [13] "Ready or Not"                         
    ##  [14] "Us"                                   
    ##  [15] "The Irishman"                         
    ##  [16] "Pet Sematary"                         
    ##  [17] "The Lion King"                        
    ##  [18] "Star Wars: The Rise Of Skywalker"     
    ##  [19] "Ford v Ferrari"                       
    ##  [20] "Alita: Battle Angel"                  
    ##  [21] "Aladdin"                              
    ##  [22] "The Lighthouse"                       
    ##  [23] "1917"                                 
    ##  [24] "I See You"                            
    ##  [25] "Jojo Rabbit"                          
    ##  [26] "Curiosa"                              
    ##  [27] "Scary Stories to Tell in the Dark"    
    ##  [28] "Queen of Hearts"                      
    ##  [29] "Uncut Gems"                           
    ##  [30] "The Blackout"                         
    ##  [31] "The King"                             
    ##  [32] "Captain Marvel"                       
    ##  [33] "Vivarium"                             
    ##  [34] "The Lodge"                            
    ##  [35] "Haunt"                                
    ##  [36] "Five Feet Apart"                      
    ##  [37] "Ma"                                   
    ##  [38] "The Platform"                         
    ##  [39] "Brightburn"                           
    ##  [40] "Anna"                                 
    ##  [41] "Bombshell"                            
    ##  [42] "Shazam!"                              
    ##  [43] "Terminator: Dark Fate"                
    ##  [44] "El Camino: A Breaking Bad Movie"      
    ##  [45] "Escape Room"                          
    ##  [46] "Ad Astra"                             
    ##  [47] "Pokémon: Detective Pikachu"           
    ##  [48] "Booksmart"                            
    ##  [49] "Official Secrets"                     
    ##  [50] "Toy Story 4"                          
    ##  [51] "Glass"                                
    ##  [52] "Portrait of a Lady on Fire"           
    ##  [53] "Polar"                                
    ##  [54] "Fast & Furious Presents: Hobbs & Shaw"
    ##  [55] "Rocketman"                            
    ##  [56] "Men in Black: International"          
    ##  [57] "The Curse of La Llorona"              
    ##  [58] "Annabelle Comes Home"                 
    ##  [59] "6 Underground"                        
    ##  [60] "Cats"                                 
    ##  [61] "The Gangster, the Cop, the Devil"     
    ##  [62] "Marriage Story"                       
    ##  [63] "Child's Play"                         
    ##  [64] "Hellboy"                              
    ##  [65] "The Goldfinch"                        
    ##  [66] "Jumanji: The Next Level"              
    ##  [67] "The Addams Family"                    
    ##  [68] "Zombieland: Double Tap"               
    ##  [69] "Angel Has Fallen"                     
    ##  [70] "Fractured"                            
    ##  [71] "Spider-Man: Far from Home"            
    ##  [72] "Saint Maud"                           
    ##  [73] "The Dead Don't Die"                   
    ##  [74] "Midway"                               
    ##  [75] "The Peanut Butter Falcon"             
    ##  [76] "Tolkien"                              
    ##  [77] "Yesterday"                            
    ##  [78] "In the Tall Grass"                    
    ##  [79] "Dark Phoenix"                         
    ##  [80] "Burn"                                 
    ##  [81] "Cold Pursuit"                         
    ##  [82] "Frozen II"                            
    ##  [83] "3 from Hell"                          
    ##  [84] "In the Shadow of the Moon"            
    ##  [85] "Hustlers"                             
    ##  [86] "Dark Waters"                          
    ##  [87] "Sound of Metal"                       
    ##  [88] "Murder Mystery"                       
    ##  [89] "Synchronic"                           
    ##  [90] "Triple Frontier"                      
    ##  [91] "Downton Abbey"                        
    ##  [92] "Color Out of Space"                   
    ##  [93] "Crawl"                                
    ##  [94] "Guns Akimbo"                          
    ##  [95] "Rambo: Last Blood"                    
    ##  [96] "Godzilla: King of the Monsters"       
    ##  [97] "The Silence"                          
    ##  [98] "Maleficent: Mistress of Evil"         
    ##  [99] "The Outpost"                          
    ## [100] "The Lego Movie 2: The Second Part"

Description

# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.lister-item-content .ratings-bar+.text-muted'))
## {xml_nodeset (100)}
##  [1] <p class="text-muted">\nA couple travels to Northern Europe to visit a r ...
##  [2] <p class="text-muted">\nDuring the 1980s, a failed stand-up comedian is  ...
##  [3] <p class="text-muted">\nA faded television actor and his stunt double st ...
##  [4] <p class="text-muted">\nAn American expat tries to sell off his highly p ...
##  [5] <p class="text-muted">\nA young woman falls for a guy with a dark secret ...
##  [6] <p class="text-muted">\nYears following the events of <a href="/title/tt ...
##  [7] <p class="text-muted">\nGreed and class discrimination threaten the newl ...
##  [8] <p class="text-muted">\nA detective investigates the death of the patria ...
##  [9] <p class="text-muted">\nTwenty-seven years after their first encounter w ...
## [10] <p class="text-muted">\nAfter the devastating events of <a href="/title/ ...
## [11] <p class="text-muted">\nJo March reflects back and forth on her life, te ...
## [12] <p class="text-muted">\nJohn Wick is on the run after killing a member o ...
## [13] <p class="text-muted">\nA bride's wedding night takes a sinister turn wh ...
## [14] <p class="text-muted">\nA family's serene beach vacation turns to chaos  ...
## [15] <p class="text-muted">\nAn illustration of Frank Sheeran's life, from W. ...
## [16] <p class="text-muted">\nDr. Louis Creed and his wife, Rachel, relocate f ...
## [17] <p class="text-muted">\nAfter the murder of his father, a young lion pri ...
## [18] <p class="text-muted">\nIn the riveting conclusion of the landmark Skywa ...
## [19] <p class="text-muted">\nAmerican car designer <a href="/name/nm0790961"> ...
## [20] <p class="text-muted">\nA deactivated cyborg's revived, but can't rememb ...
## ...
# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)
## [1] "\nA couple travels to Northern Europe to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [2] "\nDuring the 1980s, a failed stand-up comedian is driven insane and turns to a life of crime and chaos in Gotham City while becoming an infamous psychopathic crime figure."                                                                
## [3] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."                                                                                       
## [4] "\nAn American expat tries to sell off his highly profitable marijuana empire in London, triggering plots, schemes, bribery and blackmail in an attempt to steal his domain out from under him."                                             
## [5] "\nA young woman falls for a guy with a dark secret and the two embark on a rocky relationship. Based on the novel by Anna Todd."                                                                                                            
## [6] "\nYears following the events of The Shining (1980), a now-adult Dan Torrance must protect a young girl with similar powers from a cult known as The True Knot, who prey on children with powers to remain immortal."
# strip the '\n'
description_data <- str_replace_all(description_data, "^\\n", "")
head(description_data)
## [1] "A couple travels to Northern Europe to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [2] "During the 1980s, a failed stand-up comedian is driven insane and turns to a life of crime and chaos in Gotham City while becoming an infamous psychopathic crime figure."                                                                
## [3] "A faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."                                                                                       
## [4] "An American expat tries to sell off his highly profitable marijuana empire in London, triggering plots, schemes, bribery and blackmail in an attempt to steal his domain out from under him."                                             
## [5] "A young woman falls for a guy with a dark secret and the two embark on a rocky relationship. Based on the novel by Anna Todd."                                                                                                            
## [6] "Years following the events of The Shining (1980), a now-adult Dan Torrance must protect a young girl with similar powers from a cult known as The True Knot, who prey on children with powers to remain immortal."

Runtime

# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
  html_nodes('.runtime') %>%
  html_text() %>%
  str_replace(" min", "") %>%
  as.integer())
##   [1] 148 122 161 113 105 152 132 130 169 181 135 130  95 116 209 101 118 141
##  [19] 152 122 128 109 119  98 108 107 108 127 135 127 140 123  97 108  92 116
##  [37]  99  94  90 118 109 132 128 122  99 123 104 102 112 100 129 122 118 137
##  [55] 121 114  93 106 128 110 109 137  90 120 149 123  86  99 121  99 129  84
##  [73] 104 138  97 112 116 101 113  88 119 103 115 115 110 126 120  97 102 125
##  [91] 122 111  87  98  89 132  90 119 123 107
# Using CSS selectors to scrap the Movie runtime section
runtime_data_html <- html_nodes(webpage, '.runtime')
# Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)
# Let's have a look at the runtime
head(runtime_data)
## [1] "148 min" "122 min" "161 min" "113 min" "105 min" "152 min"
# Data-Preprocessing: removing mins and converting it to numerical
runtime_data <- str_replace(runtime_data, " min", "")
runtime_data <- as.numeric(runtime_data)
#Let's have another look at the runtime data
head(runtime_data)
## [1] 148 122 161 113 105 152

Genre

  • Collect the (first) genre of each movie:

    # Using CSS selectors to scrap the Movie genre section
    genre_data_html <- html_nodes(webpage, '.genre')
    # Converting the genre data to text
    genre_data <- html_text(genre_data_html)
    # Let's have a look at the genre data
    head(genre_data)    
    ## [1] "\nDrama, Horror, Mystery            "
    ## [2] "\nCrime, Drama, Thriller            "
    ## [3] "\nComedy, Drama            "         
    ## [4] "\nAction, Crime            "         
    ## [5] "\nDrama, Romance            "        
    ## [6] "\nDrama, Fantasy, Horror            "
    # Data-Preprocessing: retrieve the first word
    genre_data <- str_extract(genre_data, "[:alpha:]+")
    # Convering each genre from text to factor
    #genre_data <- as.factor(genre_data)
    # Let's have another look at the genre data
    head(genre_data)
    ## [1] "Drama"  "Crime"  "Comedy" "Action" "Drama"  "Drama"

Rating

  • # Using CSS selectors to scrap the IMDB rating section
    rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong')
    # Converting the ratings data to text
    rating_data <- html_text(rating_data_html)
    # Let's have a look at the ratings
    head(rating_data)
    ## [1] "7.1" "8.4" "7.6" "7.8" "5.3" "7.3"
    # Data-Preprocessing: converting ratings to numerical
    rating_data <- as.numeric(rating_data)
    # Let's have another look at the ratings data
    rating_data
    ##   [1] 7.1 8.4 7.6 7.8 5.3 7.3 8.5 7.9 6.5 8.4 7.8 7.4 6.9 6.8 7.8 5.7 6.8 6.4
    ##  [19] 8.1 7.3 6.9 7.4 8.2 6.8 7.9 5.4 6.2 7.1 7.4 6.0 7.3 6.8 5.9 6.0 6.3 7.2
    ##  [37] 5.6 7.0 6.1 6.6 6.8 7.0 6.2 7.3 6.4 6.5 6.5 7.1 7.3 7.7 6.6 8.1 6.3 6.5
    ##  [55] 7.3 5.6 5.3 5.9 6.1 2.8 6.9 7.9 5.7 5.2 6.4 6.7 5.8 6.7 6.4 6.4 7.4 6.7
    ##  [73] 5.5 6.7 7.6 6.8 6.8 5.5 5.7 5.7 6.2 6.8 5.4 6.2 6.3 7.6 7.7 6.0 6.2 6.5
    ##  [91] 7.4 6.2 6.1 6.3 6.1 6.0 5.3 6.6 6.8 6.6

Votes

  • # Using CSS selectors to scrap the votes section
    votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)')
    # Converting the votes data to text
    votes_data <- html_text(votes_data_html)
    # Let's have a look at the votes data
    head(votes_data)
    ## [1] "372,304"   "1,408,366" "803,311"   "372,302"   "62,170"    "208,182"
    # Data-Preprocessing: removing commas
    votes_data <- str_replace(votes_data, ",", "")
    # Data-Preprocessing: converting votes to numerical
    votes_data <- as.numeric(votes_data)
    ## Warning: NAs introduced by coercion
    #Let's have another look at the votes data
    votes_data
    ##   [1] 372304     NA 803311 372302  62170 208182 893174 742953 289434     NA
    ##  [11] 229747 407425 171717 325094 414523  96249 259977 478069 439212 285250
    ##  [21] 282516 240363 644457  62334 422985   3055  82424  13324 305220  10201
    ##  [31] 143582 592293  69371  53764  33780  78213  56839 256794 104759  89617
    ##  [41] 123159 373793 188266 278589 136342 253215 177562 127299  52248 269826
    ##  [51] 259135 103150  96112 228884 187816 142620  55545  83594 190551  54441
    ##  [61]  19837 333439  55003  95563  25882 271166  43573 193650 106301  87179
    ##  [71] 533361  43784  83397  91719  98226  44820 161742  62108 198729   6293
    ##  [81]  73941 186571  17063  58550 105283  96254 142596 161276  38217 139877
    ##  [91]  60852  53743  91163  66940 106122 195265  49210 114521  38359  74043

Director

  • CSS selector reference

    # Using CSS selectors to scrap the directors section
    (directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm4170048/?ref_=adv_li_dr_0">Ari Aster</a>
    ##  [2] <a href="/name/nm0680846/?ref_=adv_li_dr_0">Todd Phillips</a>
    ##  [3] <a href="/name/nm0000233/?ref_=adv_li_dr_0">Quentin Tarantino</a>
    ##  [4] <a href="/name/nm0005363/?ref_=adv_li_dr_0">Guy Ritchie</a>
    ##  [5] <a href="/name/nm1788310/?ref_=adv_li_dr_0">Jenny Gage</a>
    ##  [6] <a href="/name/nm1093039/?ref_=adv_li_dr_0">Mike Flanagan</a>
    ##  [7] <a href="/name/nm0094435/?ref_=adv_li_dr_0">Bong Joon Ho</a>
    ##  [8] <a href="/name/nm0426059/?ref_=adv_li_dr_0">Rian Johnson</a>
    ##  [9] <a href="/name/nm0615592/?ref_=adv_li_dr_0">Andy Muschietti</a>
    ## [10] <a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>
    ## [11] <a href="/name/nm1950086/?ref_=adv_li_dr_0">Greta Gerwig</a>
    ## [12] <a href="/name/nm0821432/?ref_=adv_li_dr_0">Chad Stahelski</a>
    ## [13] <a href="/name/nm2366012/?ref_=adv_li_dr_0">Matt Bettinelli-Olpin</a>
    ## [14] <a href="/name/nm1443502/?ref_=adv_li_dr_0">Jordan Peele</a>
    ## [15] <a href="/name/nm0000217/?ref_=adv_li_dr_0">Martin Scorsese</a>
    ## [16] <a href="/name/nm1556116/?ref_=adv_li_dr_0">Kevin Kölsch</a>
    ## [17] <a href="/name/nm0269463/?ref_=adv_li_dr_0">Jon Favreau</a>
    ## [18] <a href="/name/nm0009190/?ref_=adv_li_dr_0">J.J. Abrams</a>
    ## [19] <a href="/name/nm0003506/?ref_=adv_li_dr_0">James Mangold</a>
    ## [20] <a href="/name/nm0001675/?ref_=adv_li_dr_0">Robert Rodriguez</a>
    ## ...
    # Converting the directors data to text
    directors_data <- html_text(directors_data_html)
    # Let's have a look at the directors data
    directors_data
    ##   [1] "Ari Aster"              "Todd Phillips"          "Quentin Tarantino"     
    ##   [4] "Guy Ritchie"            "Jenny Gage"             "Mike Flanagan"         
    ##   [7] "Bong Joon Ho"           "Rian Johnson"           "Andy Muschietti"       
    ##  [10] "Anthony Russo"          "Greta Gerwig"           "Chad Stahelski"        
    ##  [13] "Matt Bettinelli-Olpin"  "Jordan Peele"           "Martin Scorsese"       
    ##  [16] "Kevin Kölsch"           "Jon Favreau"            "J.J. Abrams"           
    ##  [19] "James Mangold"          "Robert Rodriguez"       "Guy Ritchie"           
    ##  [22] "Robert Eggers"          "Sam Mendes"             "Adam Randall"          
    ##  [25] "Taika Waititi"          "Lou Jeunet"             "André Øvredal"         
    ##  [28] "May el-Toukhy"          "Benny Safdie"           "Egor Baranov"          
    ##  [31] "David Michôd"           "Anna Boden"             "Lorcan Finnegan"       
    ##  [34] "Severin Fiala"          "Scott Beck"             "Justin Baldoni"        
    ##  [37] "Tate Taylor"            "Galder Gaztelu-Urrutia" "David Yarovesky"       
    ##  [40] "Luc Besson"             "Jay Roach"              "David F. Sandberg"     
    ##  [43] "Tim Miller"             "Vince Gilligan"         "Adam Robitel"          
    ##  [46] "James Gray"             "Rob Letterman"          "Olivia Wilde"          
    ##  [49] "Gavin Hood"             "Josh Cooley"            "M. Night Shyamalan"    
    ##  [52] "Céline Sciamma"         "Jonas Åkerlund"         "David Leitch"          
    ##  [55] "Dexter Fletcher"        "F. Gary Gray"           "Michael Chaves"        
    ##  [58] "Gary Dauberman"         "Michael Bay"            "Tom Hooper"            
    ##  [61] "Won-Tae Lee"            "Noah Baumbach"          "Lars Klevberg"         
    ##  [64] "Neil Marshall"          "John Crowley"           "Jake Kasdan"           
    ##  [67] "Greg Tiernan"           "Ruben Fleischer"        "Ric Roman Waugh"       
    ##  [70] "Brad Anderson"          "Jon Watts"              "Rose Glass"            
    ##  [73] "Jim Jarmusch"           "Roland Emmerich"        "Tyler Nilson"          
    ##  [76] "Dome Karukoski"         "Danny Boyle"            "Vincenzo Natali"       
    ##  [79] "Simon Kinberg"          "Mike Gan"               "Hans Petter Moland"    
    ##  [82] "Chris Buck"             "Rob Zombie"             "Jim Mickle"            
    ##  [85] "Lorene Scafaria"        "Todd Haynes"            "Darius Marder"         
    ##  [88] "Kyle Newacheck"         "Justin Benson"          "J.C. Chandor"          
    ##  [91] "Michael Engler"         "Richard Stanley"        "Alexandre Aja"         
    ##  [94] "Jason Howden"           "Adrian Grunberg"        "Michael Dougherty"     
    ##  [97] "John R. Leonetti"       "Joachim Rønning"        "Rod Lurie"             
    ## [100] "Mike Mitchell"

Actor

  • # Using CSS selectors to scrap the actors section
    (actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm6073955/?ref_=adv_li_st_0">Florence Pugh</a>
    ##  [2] <a href="/name/nm0001618/?ref_=adv_li_st_0">Joaquin Phoenix</a>
    ##  [3] <a href="/name/nm0000138/?ref_=adv_li_st_0">Leonardo DiCaprio</a>
    ##  [4] <a href="/name/nm0000190/?ref_=adv_li_st_0">Matthew McConaughey</a>
    ##  [5] <a href="/name/nm6466214/?ref_=adv_li_st_0">Josephine Langford</a>
    ##  [6] <a href="/name/nm0000191/?ref_=adv_li_st_0">Ewan McGregor</a>
    ##  [7] <a href="/name/nm0814280/?ref_=adv_li_st_0">Song Kang-ho</a>
    ##  [8] <a href="/name/nm0185819/?ref_=adv_li_st_0">Daniel Craig</a>
    ##  [9] <a href="/name/nm1567113/?ref_=adv_li_st_0">Jessica Chastain</a>
    ## [10] <a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>
    ## [11] <a href="/name/nm1519680/?ref_=adv_li_st_0">Saoirse Ronan</a>
    ## [12] <a href="/name/nm0000206/?ref_=adv_li_st_0">Keanu Reeves</a>
    ## [13] <a href="/name/nm3034977/?ref_=adv_li_st_0">Samara Weaving</a>
    ## [14] <a href="/name/nm2143282/?ref_=adv_li_st_0">Lupita Nyong'o</a>
    ## [15] <a href="/name/nm0000134/?ref_=adv_li_st_0">Robert De Niro</a>
    ## [16] <a href="/name/nm0164809/?ref_=adv_li_st_0">Jason Clarke</a>
    ## [17] <a href="/name/nm2255973/?ref_=adv_li_st_0">Donald Glover</a>
    ## [18] <a href="/name/nm5397459/?ref_=adv_li_st_0">Daisy Ridley</a>
    ## [19] <a href="/name/nm0000354/?ref_=adv_li_st_0">Matt Damon</a>
    ## [20] <a href="/name/nm4023073/?ref_=adv_li_st_0">Rosa Salazar</a>
    ## ...
    # Converting the gross actors data to text
    actors_data <- html_text(actors_data_html)
    # Let's have a look at the actors data
    head(actors_data)
    ## [1] "Florence Pugh"       "Joaquin Phoenix"     "Leonardo DiCaprio"  
    ## [4] "Matthew McConaughey" "Josephine Langford"  "Ewan McGregor"

Metascore

  • Be careful with missing data.

    # Using CSS selectors to scrap the metascore section
    metascore_data_html <- html_nodes(webpage, '.metascore')
    # Converting the runtime data to text
    metascore_data <- html_text(metascore_data_html)
    # Let's have a look at the metascore 
    head(metascore_data)
    ## [1] "72        " "59        " "83        " "51        " "30        "
    ## [6] "59        "
    # Data-Preprocessing: removing extra space in metascore
    metascore_data <- str_replace(metascore_data, "\\s*$", "")
    metascore_data <- as.numeric(metascore_data)
    metascore_data
    ##  [1] 72 59 83 51 30 59 96 82 58 78 91 73 64 81 94 57 55 53 81 53 53 83 78 65 58
    ## [26] 61 67 92 62 64 64 64 69 53 53 73 44 40 64 71 54 72 48 80 53 84 63 84 43 95
    ## [51] 19 60 69 38 41 53 41 32 65 94 48 31 40 58 46 55 45 36 69 83 53 47 70 48 55
    ## [76] 46 43 50 57 64 50 48 79 73 82 38 64 61 64 70 60 42 26 48 25 43 71 65
    # Lets check the length of metascore data
    length(metascore_data)
    ## [1] 98
    # Visual inspection finds 24, 85, 100 don't have metascore
    ms <- rep(NA, 100)
    ms[-c(24, 85, 100)] <- metascore_data
    ## Warning in ms[-c(24, 85, 100)] <- metascore_data: number of items to replace is
    ## not a multiple of replacement length
    (metascore_data <- ms)
    ##   [1] 72 59 83 51 30 59 96 82 58 78 91 73 64 81 94 57 55 53 81 53 53 83 78 NA 65
    ##  [26] 58 61 67 92 62 64 64 64 69 53 53 73 44 40 64 71 54 72 48 80 53 84 63 84 43
    ##  [51] 95 19 60 69 38 41 53 41 32 65 94 48 31 40 58 46 55 45 36 69 83 53 47 70 48
    ##  [76] 55 46 43 50 57 64 50 48 79 NA 73 82 38 64 61 64 70 60 42 26 48 25 43 71 NA

Gross

  • Be careful with missing data.

    # Using CSS selectors to scrap the gross revenue section
    gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
    # Converting the gross revenue data to text
    gross_data <- html_text(gross_data_html)
    # Let's have a look at the gross data
    head(gross_data)
    ## [1] "$27.33M"  "$335.45M" "$142.50M" "$36.47M"  "$12.14M"  "$31.58M"
    # Data-Preprocessing: removing '$' and 'M' signs
    gross_data <- str_replace(gross_data, "M", "")
    gross_data <- str_sub(gross_data, 2, 10)
    #(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
    gross_data <- as.numeric(gross_data)
    # Let's check the length of gross data
    length(gross_data)
    ## [1] 70
    # Visual inspection finds below movies don't have gross
    #gs_data <- rep(NA, 100)
    #gs_data[-c(1, 2, 3, 5, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data
    #(gross_data <- gs_data)

    60 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.

    (rank_and_gross <- webpage %>%
      html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
      html_text() %>%
      str_replace("\\s+", "") %>%
      str_replace_all("[$M]", ""))
    ##   [1] "1."     "27.33"  "2."     "335.45" "3."     "142.50" "4."     "36.47" 
    ##   [9] "5."     "12.14"  "6."     "31.58"  "7."     "53.37"  "8."     "165.36"
    ##  [17] "9."     "211.59" "10."    "858.37" "11."    "108.10" "12."    "171.02"
    ##  [25] "13."    "28.71"  "14."    "175.08" "15."    "7.00"   "16."    "54.72" 
    ##  [33] "17."    "543.64" "18."    "515.20" "19."    "117.62" "20."    "85.71" 
    ##  [41] "21."    "355.56" "22."    "0.43"   "23."    "159.23" "24."    "25."   
    ##  [49] "33.37"  "26."    "27."    "68.95"  "28."    "29."    "30."    "31."   
    ##  [57] "32."    "426.83" "33."    "34."    "35."    "36."    "45.73"  "37."   
    ##  [65] "45.37"  "38."    "39."    "17.30"  "40."    "7.74"   "41."    "42."   
    ##  [73] "140.37" "43."    "62.25"  "44."    "45."    "57.01"  "46."    "50.19" 
    ##  [81] "47."    "144.11" "48."    "22.68"  "49."    "0.40"   "50."    "434.04"
    ##  [89] "51."    "111.05" "52."    "3.76"   "53."    "54."    "173.96" "55."   
    ##  [97] "96.37"  "56."    "80.00"  "57."    "54.73"  "58."    "74.15"  "59."   
    ## [105] "60."    "61."    "0.22"   "62."    "2.00"   "63."    "29.21"  "64."   
    ## [113] "21.90"  "65."    "5.33"   "66."    "316.83" "67."    "100.04" "68."   
    ## [121] "73.12"  "69."    "69.03"  "70."    "71."    "390.53" "72."    "73."   
    ## [129] "6.56"   "74."    "56.85"  "75."    "13.12"  "76."    "4.54"   "77."   
    ## [137] "73.29"  "78."    "79."    "65.85"  "80."    "81."    "32.14"  "82."   
    ## [145] "477.37" "83."    "84."    "85."    "104.96" "86."    "87."    "88."   
    ## [153] "89."    "90."    "91."    "96.85"  "92."    "93."    "39.01"  "94."   
    ## [161] "95."    "44.82"  "96."    "110.50" "97."    "98."    "113.93" "99."   
    ## [169] "100."   "105.81"
    isrank <- str_detect(rank_and_gross, "\\.$")
    ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
    ismissing[length(ismissing)+1] <- isrank[length(isrank)]
    missingpos <- as.integer(rank_and_gross[ismissing])
    gs_data <- rep(NA, 100)
    gs_data[-missingpos] <- gross_data
    (gross_data <- gs_data)
    ##   [1]  27.33 335.45 142.50  36.47  12.14  31.58  53.37 165.36 211.59 858.37
    ##  [11] 108.10 171.02  28.71 175.08   7.00  54.72 543.64 515.20 117.62  85.71
    ##  [21] 355.56   0.43 159.23     NA  33.37     NA  68.95     NA     NA     NA
    ##  [31]     NA 426.83     NA     NA     NA  45.73  45.37     NA  17.30   7.74
    ##  [41]     NA 140.37  62.25     NA  57.01  50.19 144.11  22.68   0.40 434.04
    ##  [51] 111.05   3.76     NA 173.96  96.37  80.00  54.73  74.15     NA     NA
    ##  [61]   0.22   2.00  29.21  21.90   5.33 316.83 100.04  73.12  69.03     NA
    ##  [71] 390.53     NA   6.56  56.85  13.12   4.54  73.29     NA  65.85     NA
    ##  [81]  32.14 477.37     NA     NA 104.96     NA     NA     NA     NA     NA
    ##  [91]  96.85     NA  39.01     NA  44.82 110.50     NA 113.93     NA 105.81

Missing entries - more reproducible way

  • Following code programatically figures out missing entries for metascore.

    # Use CSS selectors to scrap the rankings section
    (rank_metascore_data_html <- html_nodes(webpage, '.unfavorable , .favorable , .mixed , .text-primary'))
    ## {xml_nodeset (198)}
    ##  [1] <span class="lister-item-index unbold text-primary">1.</span>
    ##  [2] <span class="metascore  favorable">72        </span>
    ##  [3] <span class="lister-item-index unbold text-primary">2.</span>
    ##  [4] <span class="metascore  mixed">59        </span>
    ##  [5] <span class="lister-item-index unbold text-primary">3.</span>
    ##  [6] <span class="metascore  favorable">83        </span>
    ##  [7] <span class="lister-item-index unbold text-primary">4.</span>
    ##  [8] <span class="metascore  mixed">51        </span>
    ##  [9] <span class="lister-item-index unbold text-primary">5.</span>
    ## [10] <span class="metascore  unfavorable">30        </span>
    ## [11] <span class="lister-item-index unbold text-primary">6.</span>
    ## [12] <span class="metascore  mixed">59        </span>
    ## [13] <span class="lister-item-index unbold text-primary">7.</span>
    ## [14] <span class="metascore  favorable">96        </span>
    ## [15] <span class="lister-item-index unbold text-primary">8.</span>
    ## [16] <span class="metascore  favorable">82        </span>
    ## [17] <span class="lister-item-index unbold text-primary">9.</span>
    ## [18] <span class="metascore  mixed">58        </span>
    ## [19] <span class="lister-item-index unbold text-primary">10.</span>
    ## [20] <span class="metascore  favorable">78        </span>
    ## ...
    # Convert the ranking data to text
    (rank_metascore_data <- html_text(rank_metascore_data_html))
    ##   [1] "1."         "72        " "2."         "59        " "3."        
    ##   [6] "83        " "4."         "51        " "5."         "30        "
    ##  [11] "6."         "59        " "7."         "96        " "8."        
    ##  [16] "82        " "9."         "58        " "10."        "78        "
    ##  [21] "11."        "91        " "12."        "73        " "13."       
    ##  [26] "64        " "14."        "81        " "15."        "94        "
    ##  [31] "16."        "57        " "17."        "55        " "18."       
    ##  [36] "53        " "19."        "81        " "20."        "53        "
    ##  [41] "21."        "53        " "22."        "83        " "23."       
    ##  [46] "78        " "24."        "65        " "25."        "58        "
    ##  [51] "26."        "27."        "61        " "28."        "67        "
    ##  [56] "29."        "92        " "30."        "31."        "62        "
    ##  [61] "32."        "64        " "33."        "64        " "34."       
    ##  [66] "64        " "35."        "69        " "36."        "53        "
    ##  [71] "37."        "53        " "38."        "73        " "39."       
    ##  [76] "44        " "40."        "40        " "41."        "64        "
    ##  [81] "42."        "71        " "43."        "54        " "44."       
    ##  [86] "72        " "45."        "48        " "46."        "80        "
    ##  [91] "47."        "53        " "48."        "84        " "49."       
    ##  [96] "63        " "50."        "84        " "51."        "43        "
    ## [101] "52."        "95        " "53."        "19        " "54."       
    ## [106] "60        " "55."        "69        " "56."        "38        "
    ## [111] "57."        "41        " "58."        "53        " "59."       
    ## [116] "41        " "60."        "32        " "61."        "65        "
    ## [121] "62."        "94        " "63."        "48        " "64."       
    ## [126] "31        " "65."        "40        " "66."        "58        "
    ## [131] "67."        "46        " "68."        "55        " "69."       
    ## [136] "45        " "70."        "36        " "71."        "69        "
    ## [141] "72."        "83        " "73."        "53        " "74."       
    ## [146] "47        " "75."        "70        " "76."        "48        "
    ## [151] "77."        "55        " "78."        "46        " "79."       
    ## [156] "43        " "80."        "50        " "81."        "57        "
    ## [161] "82."        "64        " "83."        "50        " "84."       
    ## [166] "48        " "85."        "79        " "86."        "73        "
    ## [171] "87."        "82        " "88."        "38        " "89."       
    ## [176] "64        " "90."        "61        " "91."        "64        "
    ## [181] "92."        "70        " "93."        "60        " "94."       
    ## [186] "42        " "95."        "26        " "96."        "48        "
    ## [191] "97."        "25        " "98."        "43        " "99."       
    ## [196] "71        " "100."       "65        "
    # Strip spaces
    (rank_metascore_data <- str_replace(rank_metascore_data, "\\s+", ""))
    ##   [1] "1."   "72"   "2."   "59"   "3."   "83"   "4."   "51"   "5."   "30"  
    ##  [11] "6."   "59"   "7."   "96"   "8."   "82"   "9."   "58"   "10."  "78"  
    ##  [21] "11."  "91"   "12."  "73"   "13."  "64"   "14."  "81"   "15."  "94"  
    ##  [31] "16."  "57"   "17."  "55"   "18."  "53"   "19."  "81"   "20."  "53"  
    ##  [41] "21."  "53"   "22."  "83"   "23."  "78"   "24."  "65"   "25."  "58"  
    ##  [51] "26."  "27."  "61"   "28."  "67"   "29."  "92"   "30."  "31."  "62"  
    ##  [61] "32."  "64"   "33."  "64"   "34."  "64"   "35."  "69"   "36."  "53"  
    ##  [71] "37."  "53"   "38."  "73"   "39."  "44"   "40."  "40"   "41."  "64"  
    ##  [81] "42."  "71"   "43."  "54"   "44."  "72"   "45."  "48"   "46."  "80"  
    ##  [91] "47."  "53"   "48."  "84"   "49."  "63"   "50."  "84"   "51."  "43"  
    ## [101] "52."  "95"   "53."  "19"   "54."  "60"   "55."  "69"   "56."  "38"  
    ## [111] "57."  "41"   "58."  "53"   "59."  "41"   "60."  "32"   "61."  "65"  
    ## [121] "62."  "94"   "63."  "48"   "64."  "31"   "65."  "40"   "66."  "58"  
    ## [131] "67."  "46"   "68."  "55"   "69."  "45"   "70."  "36"   "71."  "69"  
    ## [141] "72."  "83"   "73."  "53"   "74."  "47"   "75."  "70"   "76."  "48"  
    ## [151] "77."  "55"   "78."  "46"   "79."  "43"   "80."  "50"   "81."  "57"  
    ## [161] "82."  "64"   "83."  "50"   "84."  "48"   "85."  "79"   "86."  "73"  
    ## [171] "87."  "82"   "88."  "38"   "89."  "64"   "90."  "61"   "91."  "64"  
    ## [181] "92."  "70"   "93."  "60"   "94."  "42"   "95."  "26"   "96."  "48"  
    ## [191] "97."  "25"   "98."  "43"   "99."  "71"   "100." "65"
    # a rank followed by another rank means the metascore for the 1st rank is missing
    (isrank <- str_detect(rank_metascore_data, "\\.$"))
    ##   [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [25]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [37]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [49]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
    ##  [61]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [73]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [85]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [97]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [109]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [121]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [133]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [145]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [157]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [169]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [181]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [193]  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ismissing <- isrank[1:length(rank_metascore_data)-1] & 
      isrank[2:length(rank_metascore_data)]
    ismissing[length(ismissing)+1] <- isrank[length(isrank)]
    (missingpos <- as.integer(rank_metascore_data[ismissing]))
    ## [1] 26 30
    #(rank_metascore_data <- as.integer(rank_metascore_data))
  • You (students) should work out the code for finding missing positions for gross.

Visualizing movie data

  • Form a tibble:

    # Combining all the lists to form a data frame
    movies <- tibble(Rank = rank_data, 
                     Title = title_data,
                     Description = description_data, 
                     Runtime = runtime_data,
                     Genre = genre_data, 
                     Rating = rating_data,
                     Metascore = metascore_data, 
                     Votes = votes_data,
                     Gross_Earning_in_Mil = gross_data,
                     Director = directors_data, 
                     Actor = actors_data)
    movies %>% print(width=Inf)
    ## # A tibble: 100 × 11
    ##     Rank Title                        
    ##    <int> <chr>                        
    ##  1     1 Midsommar                    
    ##  2     2 Joker                        
    ##  3     3 Once Upon a Time in Hollywood
    ##  4     4 The Gentlemen                
    ##  5     5 After                        
    ##  6     6 Doctor Sleep                 
    ##  7     7 Parasite                     
    ##  8     8 Knives Out                   
    ##  9     9 It Chapter Two               
    ## 10    10 Avengers: Endgame            
    ##    Description                                                                  
    ##    <chr>                                                                        
    ##  1 A couple travels to Northern Europe to visit a rural hometown's fabled Swedi…
    ##  2 During the 1980s, a failed stand-up comedian is driven insane and turns to a…
    ##  3 A faded television actor and his stunt double strive to achieve fame and suc…
    ##  4 An American expat tries to sell off his highly profitable marijuana empire i…
    ##  5 A young woman falls for a guy with a dark secret and the two embark on a roc…
    ##  6 Years following the events of The Shining (1980), a now-adult Dan Torrance m…
    ##  7 Greed and class discrimination threaten the newly formed symbiotic relations…
    ##  8 A detective investigates the death of the patriarch of an eccentric, combati…
    ##  9 Twenty-seven years after their first encounter with the terrifying Pennywise…
    ## 10 After the devastating events of Avengers: Infinity War (2018), the universe …
    ##    Runtime Genre  Rating Metascore  Votes Gross_Earning_in_Mil Director         
    ##      <dbl> <chr>   <dbl>     <dbl>  <dbl>                <dbl> <chr>            
    ##  1     148 Drama     7.1        72 372304                 27.3 Ari Aster        
    ##  2     122 Crime     8.4        59     NA                335.  Todd Phillips    
    ##  3     161 Comedy    7.6        83 803311                142.  Quentin Tarantino
    ##  4     113 Action    7.8        51 372302                 36.5 Guy Ritchie      
    ##  5     105 Drama     5.3        30  62170                 12.1 Jenny Gage       
    ##  6     152 Drama     7.3        59 208182                 31.6 Mike Flanagan    
    ##  7     132 Drama     8.5        96 893174                 53.4 Bong Joon Ho     
    ##  8     130 Comedy    7.9        82 742953                165.  Rian Johnson     
    ##  9     169 Drama     6.5        58 289434                212.  Andy Muschietti  
    ## 10     181 Action    8.4        78     NA                858.  Anthony Russo    
    ##    Actor              
    ##    <chr>              
    ##  1 Florence Pugh      
    ##  2 Joaquin Phoenix    
    ##  3 Leonardo DiCaprio  
    ##  4 Matthew McConaughey
    ##  5 Josephine Langford 
    ##  6 Ewan McGregor      
    ##  7 Song Kang-ho       
    ##  8 Daniel Craig       
    ##  9 Jessica Chastain   
    ## 10 Robert Downey Jr.  
    ## # ℹ 90 more rows
  • How many top 100 movies are in each genre? (Be careful with interpretation.)

    movies %>%
      ggplot() +
      geom_bar(mapping = aes(x = Genre))

  • Which genre is most profitable in terms of average gross earnings?

    movies %>%
      group_by(Genre) %>%
      summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm=TRUE)) %>%
      ggplot() +
        geom_col(mapping = aes(x = Genre, y = avg_earning)) + 
        labs(y = "avg earning in millions")

    ggplot(data = movies) +
      geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + 
      labs(y = "Gross earning in millions")
    ## Warning: Removed 30 rows containing non-finite values (`stat_boxplot()`).

  • Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre

    library("ggrepel")
    (best_in_genre <- movies %>%
        group_by(Genre) %>%
        filter(row_number(desc(Gross_Earning_in_Mil)) == 1))
    ## # A tibble: 8 × 11
    ## # Groups:   Genre [8]
    ##    Rank Title             Description      Runtime Genre Rating Metascore  Votes
    ##   <int> <chr>             <chr>              <dbl> <chr>  <dbl>     <dbl>  <dbl>
    ## 1     2 Joker             During the 1980…     122 Crime    8.4        59     NA
    ## 2     8 Knives Out        A detective inv…     130 Come…    7.9        82 742953
    ## 3     9 It Chapter Two    Twenty-seven ye…     169 Drama    6.5        58 289434
    ## 4    10 Avengers: Endgame After the devas…     181 Acti…    8.4        78     NA
    ## 5    14 Us                A family's sere…     116 Horr…    6.8        81 325094
    ## 6    17 The Lion King     After the murde…     118 Adve…    6.8        55 259977
    ## 7    55 Rocketman         A musical fanta…     121 Biog…    7.3        38 187816
    ## 8    82 Frozen II         Anna, Elsa, Kri…     103 Anim…    6.8        50 186571
    ## # ℹ 3 more variables: Gross_Earning_in_Mil <dbl>, Director <chr>, Actor <chr>
    ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
      geom_point(mapping = aes(size = Votes, color = Genre)) + 
      ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
      labs(y = "Gross earning in millions")
    ## Warning: Removed 32 rows containing missing values (`geom_point()`).

RSelenium Example: FCC’s television broadcast signal strength

Many websites dynamically pull data from databases using JavasScript and JQuery that make them difficult to scrape.

The FCC’s dtvmaps webpage has a simple form in which you enter a zip code and it gives you the available local TV stations in that zip code and their signal strength.

You’ll also notice the URL stays fixed with different zip codes.

Why RSelenium

  • RSelenium loads the page that we want to scrape and download the HTML from that page.

    • particularly useful when scraping something behind a login

    • simulate human behavior on a website (e.g., mouse clicking)

  • rvest provides typical scraping tools

rm(list = ls()) # clean-up workspace
library("RSelenium")
library("tidyverse")
library("rvest")

Open up a browser

rD <- rsDriver(browser="firefox", port=sample(1:7360L, 1), verbose=F)
remDr <- rD[["client"]]

Open a webpage

remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")

We want to send a string of text (zip code) into the form.

zip <- "70118"
# remDr$findElement(using = "id", value = "startpoint")$clearElement()
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
# other possible ("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text")

Click on the button Go!

remDr$findElements("id", "btnSub")[[1]]$clickElement()

Extract data from HTML

  • save HTML to an object

  • use rvest for the rest

Sys.sleep(5) # give the page time to fully load, in seconds
html <- remDr$getPageSource()[[1]]
# important to close the client
remDr$close()

signals <- read_html(html) %>% 
  html_nodes("table.tbl_mapReception") %>% # extract table nodes with class = "tbl_mapReception"
  .[3] %>% # keep the third of these tables
  .[[1]] %>% # keep the first element of this list
  html_table(fill=T) # have rvest turn it into a dataframe
signals
## # A tibble: 37 × 6
##    Callsign                       Callsign             Network `Ch#` Band  IA   
##    <chr>                          <chr>                <chr>   <chr> <chr> <lgl>
##  1 "Click on callsign for detail" "Click on callsign … "Click… "Cli… "Cli… NA   
##  2 ""                             "WWL-TV"             "CBS"   "4"   "UHF" NA   
##  3 ""                             ""                   ""      ""    ""    NA   
##  4 ""                             "WUPL"               "MYNE"  "54"  "UHF" NA   
##  5 ""                             ""                   ""      ""    ""    NA   
##  6 ""                             "WPXL-TV"            "ION"   "49"  "UHF" NA   
##  7 ""                             ""                   ""      ""    ""    NA   
##  8 ""                             "WHNO"               "IND"   "20"  "UHF" NA   
##  9 ""                             ""                   ""      ""    ""    NA   
## 10 ""                             "WGNO"               "ABC"   "26"  "UHF" NA   
## # ℹ 27 more rows

More formatting on signals

names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2") # rename columns

signals <- signals %>%
  slice(2:n()) %>% # drop unnecessary first row
  filter(callsign != "") %>% # drop blank rows
  select(callsign:band) # drop unnecessary columns
signals
## # A tibble: 18 × 4
##    callsign network ch_num band 
##    <chr>    <chr>   <chr>  <chr>
##  1 WWL-TV   "CBS"   "4"    UHF  
##  2 WUPL     "MYNE"  "54"   UHF  
##  3 WPXL-TV  "ION"   "49"   UHF  
##  4 WHNO     "IND"   "20"   UHF  
##  5 WGNO     "ABC"   "26"   UHF  
##  6 WVUE-DT  "FOX"   "8"    UHF  
##  7 WDSU     "NBC"   "6"    UHF  
##  8 WNOL-TV  "CW"    "38"   UHF  
##  9 KGLA-DT  "IND"   "42"   UHF  
## 10 WTNO-CD  ""      ""     UHF  
## 11 WYES-TV  "PBS"   "12"   Hi-V 
## 12 WLAE-TV  "PBS"   "32"   UHF  
## 13 KNOV-CD  ""      ""     UHF  
## 14 WBXN-CD  ""      ""     UHF  
## 15 WVLA-TV  "NBC"   "33"   UHF  
## 16 WBRZ-TV  "ABC"   "2"    Hi-V 
## 17 WGMB-TV  "FOX"   "44"   UHF  
## 18 WAFB     "CBS"   "9"    Hi-V

Capture all text by clicking on each Callsign

read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick")
##  [1] "getdetail(15230,74192,'WWL-TV Facility ID: 74192 <br>WWL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=74192 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/74192 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 27<br>RX Strength: 115 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WWL-TV<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WWL-TV')"     
##  [2] "getdetail(15231,13938,'WUPL Facility ID: 13938 <br>WUPL (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=13938 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/13938 target=_new>Public File</a>)<br>City of License: SLIDELL, LA<br>RF Channel: 17<br>RX Strength: 114 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WUPL<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WUPL')"                 
##  [3] "getdetail(15800,21729,'WPXL-TV Facility ID: 21729 <br>WPXL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=21729 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/21729 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 33<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 82°','WPXL-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 82 deg',29.9827777777778,-89.9494444444445,'WPXL-TV')" 
##  [4] "getdetail(16584,37106,'WHNO Facility ID: 37106 <br>WHNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=37106 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/37106 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 21<br>RX Strength: 111 dbuV/m<br>Tower Distance: 6 mi; Direction: 120°','WHNO<br>Distance to Tower: 6 miles<br>Direction to Tower: 120 deg',29.9203055555556,-90.0245833333333,'WHNO')"             
##  [5] "getdetail(15221,72119,'WGNO Facility ID: 72119 <br>WGNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=72119 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/72119 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 26<br>RX Strength: 111 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WGNO<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WGNO')"                          
##  [6] "getdetail(15232,4149,'WVUE-DT Facility ID: 4149 <br>WVUE-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=4149 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/4149 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 29<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 94°','WVUE-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 94 deg',29.9541388888889,-89.9495277777778,'WVUE-DT')"     
##  [7] "getdetail(15212,71357,'WDSU Facility ID: 71357 <br>WDSU (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=71357 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/71357 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 19<br>RX Strength: 110 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WDSU<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WDSU')"                          
##  [8] "getdetail(15220,54280,'WNOL-TV Facility ID: 54280 <br>WNOL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=54280 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/54280 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 15<br>RX Strength: 110 dbuV/m<br>Tower Distance: 9 mi; Direction: 96°','WNOL-TV<br>Distance to Tower: 9 miles<br>Direction to Tower: 96 deg',29.95,-89.9577777777778,'WNOL-TV')"              
##  [9] "getdetail(15195,83945,'KGLA-DT Facility ID: 83945 <br>KGLA-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=83945 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/83945 target=_new>Public File</a>)<br>City of License: HAMMOND, LA<br>RF Channel: 35<br>RX Strength: 110 dbuV/m<br>Tower Distance: 10 mi; Direction: 84°','KGLA-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 84 deg',29.9783333333333,-89.9405555555556,'KGLA-DT')"     
## [10] "getdetail(16703,24981,'WTNO-CD Facility ID: 24981 <br>WTNO-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=24981 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/24981 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 22<br>RX Strength: 109 dbuV/m<br>Tower Distance: 2 mi; Direction: 292°','WTNO-CD<br>Distance to Tower: 2 miles<br>Direction to Tower: 292 deg',29.9746111111111,-90.1434722222222,'WTNO-CD')" 
## [11] "getdetail(16332,25090,'WYES-TV Facility ID: 25090 <br>WYES-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=25090 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/25090 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 11<br>RX Strength: 101 dbuV/m<br>Tower Distance: 10 mi; Direction: 94°','WYES-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 94 deg',29.9538888888889,-89.9494444444445,'WYES-TV')" 
## [12] "getdetail(15857,18819,'WLAE-TV Facility ID: 18819 <br>WLAE-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=18819 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/18819 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 23<br>RX Strength: 105 dbuV/m<br>Tower Distance: 10 mi; Direction: 82°','WLAE-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 82 deg',29.9827777777778,-89.9525,'WLAE-TV')"          
## [13] "getdetail(16837,64048,'KNOV-CD Facility ID: 64048 <br>KNOV-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=64048 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/64048 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 31<br>RX Strength: 102 dbuV/m<br>Tower Distance: 3 mi; Direction: 107°','KNOV-CD<br>Distance to Tower: 3 miles<br>Direction to Tower: 107 deg',29.9521388888889,-90.0702777777778,'KNOV-CD')" 
## [14] "getdetail(16816,70419,'WBXN-CD Facility ID: 70419 <br>WBXN-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70419 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70419 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 36<br>RX Strength: 92 dbuV/m<br>Tower Distance: 6 mi; Direction: 132°','WBXN-CD<br>Distance to Tower: 6 miles<br>Direction to Tower: 132 deg',29.9063611111111,-90.0394722222222,'WBXN-CD')"  
## [15] "getdetail(16563,70021,'WVLA-TV Facility ID: 70021 <br>WVLA-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70021 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70021 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 34<br>RX Strength: 53 dbuV/m<br>Tower Distance: 74 mi; Direction: 290°','WVLA-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 290 deg',30.3262777777778,-91.2766944444444,'WVLA-TV')"
## [16] "getdetail(16251,38616,'WBRZ-TV Facility ID: 38616 <br>WBRZ-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38616 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38616 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 13<br>RX Strength: 46 dbuV/m<br>Tower Distance: 69 mi; Direction: 290°','WBRZ-TV<br>Distance to Tower: 69 miles<br>Direction to Tower: 290 deg',30.2969444444444,-91.1936111111111,'WBRZ-TV')"
## [17] "getdetail(15727,12520,'WGMB-TV Facility ID: 12520 <br>WGMB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=12520 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/12520 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 24<br>RX Strength: 50 dbuV/m<br>Tower Distance: 74 mi; Direction: 290°','WGMB-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 290 deg',30.3262777777778,-91.2766944444444,'WGMB-TV')"
## [18] "getdetail(16368,589,'WAFB Facility ID: 589 <br>WAFB (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=589 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/589 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 9<br>RX Strength: 38 dbuV/m<br>Tower Distance: 71 mi; Direction: 293°','WAFB<br>Distance to Tower: 71 miles<br>Direction to Tower: 293 deg',30.3663888888889,-91.2130555555556,'WAFB')"

Extract signal by string operations

strength <- read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick") %>% 
  str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")

# (?<=…)  is a special regex expression for positive lookbehind

signals <- cbind(signals, strength)
signals
##    callsign network ch_num band strength
## 1    WWL-TV     CBS      4  UHF      115
## 2      WUPL    MYNE     54  UHF      114
## 3   WPXL-TV     ION     49  UHF      111
## 4      WHNO     IND     20  UHF      111
## 5      WGNO     ABC     26  UHF      111
## 6   WVUE-DT     FOX      8  UHF      111
## 7      WDSU     NBC      6  UHF      110
## 8   WNOL-TV      CW     38  UHF      110
## 9   KGLA-DT     IND     42  UHF      110
## 10  WTNO-CD                 UHF      109
## 11  WYES-TV     PBS     12 Hi-V      101
## 12  WLAE-TV     PBS     32  UHF      105
## 13  KNOV-CD                 UHF      102
## 14  WBXN-CD                 UHF       92
## 15  WVLA-TV     NBC     33  UHF       53
## 16  WBRZ-TV     ABC      2 Hi-V       46
## 17  WGMB-TV     FOX     44  UHF       50
## 18     WAFB     CBS      9 Hi-V       38