The Truth Is Out There

October 4, 2020

Background

I’ve had an idea floating around for a while to do a tidy text analysis on X-Files episode scripts. The X-Files has been around long enough that there are tons of fandom sites, and you can easily find transcripts of the original 9 seasons.First, I wanted to get some basic information about the episodes, so that’s what this post will focus on. My first thought was to go to Wikipedia. There is a page with tables for each season and I can use that as the base URL for scraping.

Packages used in this first episode:

Tidyverse, obviously. This is always the first package I load.
Janitor, which has a couple of functions I like to use, especially clean_names() to clean and remove special characters from column names in new datasets.
Rvest, which can be used to scrape data from websites.
Googlesheets4, which is an update of the original googlesheets package. I can use this to store my data because I have more than one computer I use on a regular basis.
Extrafont (pretty self-explanatory).
Ggiraph for graph animation.

Now, on to the X-Files.

Scraping from Wikipedia Tables

To find out what selector you need to look at the tables of interest, you can use a Chrome extension called SelectorGadget, or you can just right click on the specific spot on a website and choose “inspect” in the dropdown menu that comes up - which is what I usually do.

In this case it’s pretty easy, the html nodes I’m interested in are simply “table” class. At first glance, it looks like tables 2 through 14 are “wiki episode table”. That’s one more than I would expect, because there are nine original seasons, and then two follow-up seasons that came out more recently (10-11). However, there are also two X-Files movies, which appear to have separate tables. I don’t want to deal with those right now really, so I will leave them out when I scrape the tables.

wiki <- "https://en.wikipedia.org/wiki/List_of_The_X-Files_episodes"

wiki %>% 
  read_html() %>% 
  html_nodes(., "table")
## {xml_nodeset (19)}
##  [1] <table class="wikitable plainrowheaders" style="text-align:center"><tbod ...
##  [2] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [3] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [4] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [5] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [6] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [7] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [8] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
##  [9] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [10] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [11] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [12] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [13] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [14] <table class="wikitable plainrowheaders wikiepisodetable" style="width:1 ...
## [15] <table role="presentation" class="mbox-small plainlinks sistersitebox" s ...
## [16] <table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner ...
## [17] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo ...
## [18] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo ...
## [19] <table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner ...

So before we get the tables, I’m just going to select the nodes I actually want to collect, then use html_table to gather them all in table format.

tables <- wiki %>%
  read_html() %>%
  html_nodes(., "table") %>%
  .[c(2:6, 8:11, 13:14)] %>%
  html_table(fill = TRUE)

I won’t show all the tables just for the sake of space, but here is the first one. It looks like the “prod code” column is going to cause issues when I try and map them to one dataframe, because in some cases there are hyperlinks which results in inconsistent column names. (Hint: the table is interactive so you can flip over to the other columns or down to the next page).

reactable(tables[[1]],
          bordered = TRUE,
                     highlight = TRUE,
                     striped = TRUE,
                     compact = TRUE,
                     wrap = FALSE
          )

I’m going to add names to this list of dataframes before I clean the column names.

names(tables) <- rep(paste0("Season ", 1:11))

Now to get the column names. They are all the same aside from the issue I mentioned earlier.

names(tables[[1]] %>% clean_names())
## [1] "no_overall"           "no_inseason"          "title"               
## [4] "directed_by"          "written_by"           "original_air_date"   
## [7] "prod_code_40"         "u_s_viewers_millions"

I’ll replace prod_code, and I also want to replace the last column u_s_viewers_millions because it’s a bit long. Usually I would do this right to a dataset, but it’s helpful to be able to see everything before I apply it to the list.

col.names <- names(tables[[1]] %>% clean_names())

col.names[7] <- "prod_code"
col.names[8] <- "no_viewers"

col.names
## [1] "no_overall"        "no_inseason"       "title"            
## [4] "directed_by"       "written_by"        "original_air_date"
## [7] "prod_code"         "no_viewers"

Now, apply this to the list of dataframes. I can use map and set_names from the purrr package to do this very easily.

tables <- tables %>% 
  map(. %>% set_names(col.names))

Now to gather this list of tables together into one dataframe, I’m using purrr::map_df. I find it useful in certain cases instead of using bind_rows on its own, because I can create a column that tells me where the data came from. In this case, I’m creating a “season” column because each season is a separate dataset. This will be helpful in filtering the data and creating charts later.

table_df <- tables %>%
  map_df(., bind_rows, .id = "season")

I know I want to put the date in a proper format, and I also want to fix the number of viewers column.

Fixing the date is simple. I’m just using the stringr function “word” to grab the date within the parenthesis, and the ymd function from lubrdiate to change it from a character to date format. When I used the “sep” argument I used the double backslash because the paranethsis is a special character.

I’m using the same method for the no_viewers column except of course changing it to numeric instead of date.

table_df <- table_df %>%
  mutate(original_air_date = ymd(word(original_air_date, 2, sep = "\\(")),
         no_viewers = as.numeric(word(no_viewers, 1, sep = "\\[")))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

Scraping Data from IMDB

I also thought I would get some episode ratings from IMDB as well. First I’ll go to the IMDB link for X-Files episodes. Now this is a situation where I will have to navigate to different pages because there is one page for every season. However, all of the pages can be reached using the same link as a base, and pasting the season number onto the end of the string. This code results in a list of 11 links for the read_html function to follow.

imdb.link <- "https://www.imdb.com/title/tt0106179/episodes?season="

links <- rep(paste0(imdb.link, 1:11))

I need to use map to iterate the code over the list of links. For each link (season), the code is going to use read_html, followed by html_nodes - which I got by using inspect on the page just like with wikipedia. After that, it’s simply using html_text to get the text contained in those nodes, and changing them to numeric.

Again I won’t print them all to save space, but this is what season 1 looks like:

temp <- links %>%
  map(
    . %>% read_html() %>%
      html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>%
      html_text() %>%
      as.numeric()
  )

temp[1]
## [[1]]
##  [1] 8.3 8.1 8.5 7.5 7.0 7.2 6.7 8.7 6.2 7.8 8.1 7.3 8.6 7.3 7.0 7.3 8.4 7.1 7.0
## [20] 8.4 8.3 7.1 7.2 9.0

After all that is done, I will use flatten (because the list currently has two levels) and unlist to get the ratings into a vector of 217 ratings. That’s one for each episode, and it matches the number of rows in table_df, so I can just add it as a column.

table_df <- table_df %>%
  mutate(rating = temp %>% flatten %>% unlist())

I’m also interested in getting the item descriptions. I still want to look at the scripts, but these are short and sweet and I think they’ll be useful. I’ll use the same method as the episode ratings for this.

temp <- links %>% 
  map(. %>% read_html() %>% 
  html_nodes(".item_description") %>%
  html_text())

table_df <- table_df %>%
  mutate(description = temp %>% flatten() %>% unlist() %>% str_trim(., side = "both"))

Here’s the table so far.

reactable(table_df,
          bordered = TRUE,
                     highlight = TRUE,
                     striped = TRUE,
                     compact = TRUE,
                     wrap = FALSE)

Plotting Episode Ratings

There is still some cleaning to be done. I have to fix the last episode of season 9, because it’s actually two episodes (201/202). In the wikipedia table they are both on one line.

table_df <- table_df %>%
  slice(201) %>%
  mutate(
    no_overall = 201,
    no_inseason = 19,
    title = paste0(title, " Part I")
  ) %>%
  bind_rows(table_df) 



table_df <- table_df %>%
  slice(202) %>%
  mutate(
    no_overall = 202,
    no_inseason = 20,
    title = paste0(title, " Part II")
  ) %>%
  bind_rows(table_df) %>%
  filter(no_overall < 1920) %>%
  arrange(no_overall)

table_df <- table_df %>%
  mutate(season = factor(
    season,
    levels = c(
      "Season 1",
      "Season 2",
      "Season 3",
      "Season 4",
      "Season 5" ,
      "Season 6",
      "Season 7",
      "Season 8",
      "Season 9",
      "Season 10",
      "Season 11"
    )
  ))

table_df <- table_df %>%
  mutate(title = str_remove_all(title, "‡")) %>%
  mutate(full_title = paste0("Episode ", no_inseason, "\n", title))

Now I can plot the episodes by season and rating. The ggiraph package has a function that will add tooltips to a regular R plot, so I’ve added the episode number and title and it will pop up as your mouse hovers over each point. I put some additional theme formatting in my_theme that I’m not showing here due to space.

p <- table_df %>% 
  filter(!is.na(rating)) %>%
  ggplot(aes(rating, fct_rev(season), color = season)) +
  geom_point_interactive(aes(tooltip = full_title, data_id = no_overall),
                         size = 3,
                         alpha = .75) +
  scale_x_continuous(breaks = seq(1, 10, .5)) +
  xlab("Viewer Rating") +
  ggtitle("X Files Episodes by Season and Viewer Rating") +
  my_theme

girafe(
  ggobj = p,
  fonts = list(sans = "Arial"),
  options = list(
    opts_tooltip(css = tooltip.css,
                 use_fill = TRUE),
    opts_sizing(width = .9)
  )
)

Don’t forget to hover over the graph to check out the pop-ups! In the next episode I will be web scraping episode scripts to use in text analysis.

Posted on:: October 4, 2020

Length:: 47 minute read, 9936 words

Categories:: Projects XFiles

Tags:: X-Files Web Scraping Datasets

See Also:: The Office Part III: 37 Pieces of Flair; Tidy Tuesday: Paralympic Medals; The Office Part II: The Smartest Guys In The Room