Tidy Tuesday: New York Times Bestsellers

By Katie Press in TidyTuesday

May 11, 2022

This week’s data is on New York Times Bestseller Lists. I love books so I was pretty interested in this data! I am using the nyt_full.tsv dataset from the Tidy Tuesday repo. I also created some supplemental data to use in this post, which you can download here on Kaggle.

nyt_full <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-10/nyt_full.tsv', show_col_types = FALSE)

color.pal1 <- c("#34516F",  "#4E79A7", "#A0CBE8", #blue
                "#C0660C" , "#F28E2B", "#FFBE7D", #oranges
                "#3C6D36", "#59A14F", "#8CD17D", #green
                "#B6992D", "#F1CE63", #brown and yellow
                "#2E605E", "#499894", "#86BCB6", #sea green
                "#AD1F21" , "#E15759", "#FF9D9A", #pinks
                "#79706E", "#BAB0AC", #silvers
                "#D37295", "#FABFD2", #pinks
                "#B07AA1", "#D4A6C8", #purple
                "#9D7660", "#D7B5A6" #browns
                )

Longest Active Authors on the NYT Bestseller List

Data Cleaning and Aggregation

Just looking at the number of books per author, there is some cleaning to do here. A few titles start with a question mark or exclamation point and “by”.

count(nyt_full, author) %>% head()

## # A tibble: 6 × 2
##   author                    n
##   <chr>                 <int>
## 1 ! by Fannie Flagg        11
## 2 ! by Jackie Collins       6
## 3 ! by Terry Brooks         2
## 4 ! by Terry Pratchett      2
## 5 ! by W. E. B. Griffin     5
## 6 ? by Isabel Bolton        1

String Manipulation

Get rid of those unwanted characters using stringr package with regular expressions. Make sure to include the white space in the string so it doesn’t end up with leading white spaces.

nyt_full <- nyt_full %>% 
  mutate(author = str_remove(author, "! by |\\? by "))

I might want to do some analysis by decade, we already have the years. Using lubridate functions for this. The “year” will format as a 4-digit year, otherwise by just using floor_date we would get 1930-01-01. Which would probably be fine, except it’s not going to be that great for visualization.

nyt_full <- nyt_full %>% 
  mutate(decade = year(floor_date(week, years(10))))

Also converting the title to title case.

nyt_full <- nyt_full %>% 
  mutate(title = str_to_title(title))

Check to see if any of the authors are listed more than once in a week. Some do, but it looks like they are for different books. So I will count weeks once per author per book.

nyt_full %>% 
  get_dupes(author, week) %>% 
  head()

## # A tibble: 6 × 8
##   author           week       dupe_count  year  rank title_id title       decade
##   <chr>            <date>          <int> <dbl> <dbl>    <dbl> <chr>        <dbl>
## 1 Charlaine Harris 2009-10-25          2  2009     2      261 A Touch Of…   2000
## 2 Charlaine Harris 2009-10-25          2  2009    15     1323 Dead And G…   2000
## 3 Charlaine Harris 2009-11-15          2  2009     9     2045 Grave Secr…   2000
## 4 Charlaine Harris 2009-11-15          2  2009    15      261 A Touch Of…   2000
## 5 Dan Brown        2004-01-18          2  2004     1     4918 The Da Vin…   2000
## 6 Dan Brown        2004-01-18          2  2004    14      462 Angels & D…   2000

I noticed that some of the authors have books they collaborated on with others. For example, Terry Pratchet has quite a few books and only one of them has a co-author. I will split these into a separate column so that I can count the weeks or books per first author.

nyt_full %>% 
  filter(str_detect(author, "Terry Pra"))

## # A tibble: 9 × 7
##    year week        rank title_id title                  author           decade
##   <dbl> <date>     <dbl>    <dbl> <chr>                  <chr>             <dbl>
## 1  2005 2005-10-02     4     6847 Thud                   Terry Pratchett    2000
## 2  2005 2005-10-09    15     6847 Thud                   Terry Pratchett    2000
## 3  2007 2007-10-07     4     2885 Making Money           Terry Pratchett    2000
## 4  2007 2007-10-14    10     2885 Making Money           Terry Pratchett    2000
## 5  2009 2009-10-25    13     6589 The Unseen Academicals Terry Pratchett    2000
## 6  2011 2011-10-30     3     4176 Snuff                  Terry Pratchett    2010
## 7  2012 2012-07-08    13     5688 The Long Earth         Terry Pratchett…   2010
## 8  2014 2014-04-06     2     3712 Raising Steam          Terry Pratchett    2010
## 9  2014 2014-04-13    13     3712 Raising Steam          Terry Pratchett    2010

This is a good reason to use str_split_fixed, which returns a character matrix. All I’m doing here is creating a column for the first author and telling it to put in the first item in the matrix, and same for author2 and the second matrix item.

nyt_full <- nyt_full %>% 
  mutate(author1 = str_split_fixed(author, " and ", n = 2)[,1],
         author2 = str_split_fixed(author, " and ", n = 2)[,2])

Calculate Time Active by Author

Get the first and last week per author, not per book. Then calculate the number of years active.

nyt_full <- nyt_full %>% 
  group_by(author1) %>% 
  mutate(first_week = min(week),
         last_week = max(week)) %>% 
  mutate(total_author_years = round(time_length(first_week %--% last_week, "years"), 2))

I’m going to focus on the top 25 longest active authors. I can use slice_max() from dplyr to do this quickly and easily. I am going to exclude the author “Anonymous” because it probably represents several different authors.

author_long <- nyt_full %>% 
  ungroup() %>% 
  filter(author1 != "Anonymous") %>% 
  distinct(author1, first_week, last_week, total_author_years) %>% 
  slice_max(total_author_years, n = 25, with_ties = FALSE)

Looks like Ernest Hemingway is the winner for longest-active author on the NYT best seller list.

head(author_long)

## # A tibble: 6 × 4
##   author1           first_week last_week  total_author_years
##   <chr>             <date>     <date>                  <dbl>
## 1 Ernest Hemingway  1937-10-25 1999-08-15               61.8
## 2 Norman Mailer     1948-05-23 2007-02-25               58.8
## 3 John le Carré     1964-01-19 2019-11-10               55.8
## 4 Harper Lee        1960-08-07 2016-04-10               55.7
## 5 Margaret Mitchell 1936-07-06 1992-01-19               55.5
## 6 Irving Stone      1934-10-08 1985-12-01               51.2

Author Death Dates

I thought it would be interesting to look at the authors’ dates of death (most of them are no longer alive) to see how many were on the bestseller list after they died.

date_df <- tribble(~author1, ~death_date,
                   "Ernest Hemingway", "1961-07-02",
                   "Norman Mailer", "2007-11-10",
                   "John le Carré", "2020-12-12",
                   "Harper Lee", "2016-02-19",
                   "Margaret Mitchell", "1949-08-11",
                   "Irving Stone", "1989-08-26",
                   "Gore Vidal", "2012-07-31",          
                   "Graham Greene", "1991-04-03",
                   "Michael Crichton", "2008-11-04",
                   "John Updike", "2009-01-27",
                   "Frederick Forsyth", "",
                   "Herman Wouk", "2019-05-17",
                   "Saul Bellow", "2005-04-05",
                   "Philip Roth", "2018-05-22",
                   "Leon Uris", "2003-06-21",
                   "Thomas Harris", "",
                   "Clive Cussler", "2020-02-24",
                   "James A. Michener", "1997-10-16",
                   "Howard Fast", "2003-03-12",
                   "Stephen King", "",
                   "Harold Robbins", "1997-10-14",
                   "Agatha Christie", "1976-01-12",
                   "Mary Higgins Clark", "2020-01-31",
                   "Taylor Caldwell", "1985-08-30",
                   "Helen MacInnes", "1985-09-30"
                   ) %>% 
  mutate(death_date = ymd(death_date))

Join the death dates and the rest of the weeks back onto the author_long data. Creating a new column for author status and book status, I’m not sure which one I’ll use yet.

author_full <- author_long %>% 
  left_join(date_df) %>% 
  left_join(nyt_full) %>% 
  mutate(author_status = ifelse(week <= death_date | is.na(death_date), "Alive", "Dead")) %>% 
  group_by(author1, title) %>% 
  mutate(book_status = ifelse(week == min(week), "New Bestseller!", "Current Bestseller")) %>% 
  ungroup()

Some More Data Cleaning for Charts

Creating a vector of dates to serve as the x-axis for the plot (going by decades). Convert the author1 to a factor for ordering in charts. Also reversing the author factor because I want to have the author names in alphabetical order starting at the top of the y axis.

date.vec <- seq(ymd("1930-01-01"), ymd("2020-01-01"), by = "10 years")

author_full <- author_full %>% 
  mutate(Author = fct_rev(factor(author1)))

date_df <- date_df %>% 
  mutate(Author = fct_rev(factor(author1)))

Final Longest Listed Authors GGplot

Now for the plot, I’m using two different datsets so I need to specify the dataframe and aes() in each geom instead of in the overall ggplot(). Selecting the columns is just so I can rename them quickly and easily so they look good in the resulting tooltip (in the plotly version). I’m mapping the title to label and book status to group - this will not affect the ggplot chart, but these will show up in the tooltip when it’s converted using ggplotly. The Courier Prime font is from Google fonts.

p1 <- ggplot()+
  geom_point(data = author_full %>% 
               select("Week" = week,
                      Author,
                      "Title" = title,
                      "Book Status" = book_status),
             aes(x = Week,
                 y = Author,
                 color = Author,
                 label = Title, 
                 group = `Book Status`),
             shape = "square",
             size = 2
             )+
  geom_point(data = date_df %>% 
               select(Author, "Date of Death" = death_date),
             aes(x = `Date of Death`, y = Author, color = Author),
             shape = "asterisk",
             size = 3.5
             )+ 
  scale_color_manual(values = color.pal1)+
  scale_y_discrete(limits = levels(date_df$Author))+
  scale_x_date(breaks = date.vec, date_labels = "%Y")+
  labs(title = "New York Times Bestseller List",
       subtitle = "25 Longest Active Authors",
       caption = "On Bestseller List: □   Author's Death:  * \n Source: Post45 Data Collective           ")+
  theme_wsj()+
  theme(legend.position = "none",
        axis.title = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(size = .5, color = "#F0DAA8"),
        axis.text.y = element_text(family = "Courier Prime"),
        axis.text.x = element_text(family = "Courier Prime"),
        plot.caption = element_text(size = 12, family = "Courier New"),
        plot.title = element_text(hjust = 1, size = 32),
        plot.subtitle = element_text(hjust = 1, size = 28))

p1

Final Interactive Longest Authors Plot

To turn this into an interactive plot with Plotly, all I have to do is use ggplotly(). This apparently removes titles and subtitles though, so I had to re-add them using the plotly layout(). Unfortunately the custom Olde English font is not showing up in the resulting html document, even though it shows up in the preview window of the knitted RMD. I also had to add an annotation to replace the caption from the ggplot chart.

ggplotly(p1, tooltip = c("x", "y", "label", "group")) %>%
  layout(
    title = list(
      "text" = paste0(
        "New York Times Bestseller List",
        '<br>',
        '<sup>',
        "Longest Active Authors"
      ),
      font = list(family = "Courier Prime",
                  size = 24)
    ),
    annotations = list(
      x = 1,
      y = -0.08,
      text = "On Bestseller List: □   Author's Death:  *",
      xref = 'paper',
      yref = 'paper',
      xanchor = 'right',
      yanchor = 'auto',
      showarrow = FALSE,
      font = list(size = 12, family = "Courier New")
    )
  )

Bestselling Books by Decade

I recently saw a ggplot2 extension called ggpage and thought it looked cool, and since this data is all about books, I figured this would be a great time to use it.

Data Aggregation

First calculate the number of weeks on the bestseller list by title.

author_full <- author_full %>% 
  group_by(author1, title) %>% 
  mutate(first_title_date = min(week), 
         last_title_date = max(week)) %>% 
  mutate(title_weeks_total = time_length(first_title_date %--% last_title_date, "weeks")) %>% 
  ungroup()

I will probably do this by decade, and some of these books have been popular in multiple decades. Gone With the Wind is of course the prime example. I think I’ll start with trying the all time weeks total and selecting the decade in which the title was originally published.

author_full %>% 
  distinct(author1, title, title_weeks_total, decade) %>% 
  group_by(decade) %>% 
  arrange(decade, desc(title_weeks_total)) %>% 
  slice(1)

## # A tibble: 10 × 4
## # Groups:   decade [10]
##    author1           title                  decade title_weeks_total
##    <chr>             <chr>                   <dbl>             <dbl>
##  1 Margaret Mitchell Gone With The Wind       1930             2898.
##  2 Norman Mailer     The Naked And The Dead   1940               61 
##  3 Herman Wouk       The Caine Mutiny         1950              127 
##  4 Harper Lee        To Kill A Mockingbird    1960               97 
##  5 Stephen King      The Stand                1970              628 
##  6 Margaret Mitchell Gone With The Wind       1980             2898.
##  7 Margaret Mitchell Gone With The Wind       1990             2898.
##  8 Thomas Harris     Hannibal                 2000               28 
##  9 Harper Lee        Go Set A Watchman        2010               36 
## 10 Stephen King      The Institute            2020               22

That’s not exactly what I want, because I need more variety so I don’t want to have Gone with the Wind on there three times. What about longest per decade?

author_full <- author_full %>% 
  group_by(author1, title, decade) %>% 
  mutate(first_title_date_d = min(week), 
         last_title_date_d = max(week)) %>% 
  mutate(title_weeks_total_decade = time_length(first_title_date_d %--% last_title_date_d, "weeks")) %>% 
  ungroup()

That looks better.

plot_df <- author_full %>% 
  distinct(author1, title, title_weeks_total_decade, decade) %>% 
  group_by(decade) %>% 
  arrange(decade, desc(title_weeks_total_decade)) %>% 
  slice(1)

plot_df

## # A tibble: 10 × 4
## # Groups:   decade [10]
##    author1           title                    decade title_weeks_total_decade
##    <chr>             <chr>                     <dbl>                    <dbl>
##  1 Margaret Mitchell Gone With The Wind         1930                      124
##  2 Norman Mailer     The Naked And The Dead     1940                       61
##  3 Herman Wouk       The Caine Mutiny           1950                      127
##  4 Harper Lee        To Kill A Mockingbird      1960                       97
##  5 Leon Uris         Trinity                    1970                       76
##  6 James A. Michener The Covenant               1980                       42
##  7 Stephen King      The Stand                  1990                       38
##  8 Philip Roth       The Plot Against America   2000                       17
##  9 Harper Lee        Go Set A Watchman          2010                       36
## 10 Stephen King      If It Bleeds               2020                       18

Plotting with ggpage

Read in the book summary data (linked from Kaggle earlier). To create this dataset, I just copied and pasted them into this Excel since I’m only using 10 books it was just easier that way. So we have one row per book, and the summary text is in a single column.

book_summary <- read_excel("book_summary.xlsx")

head(book_summary)

## # A tibble: 6 × 4
##   title                  book_no decade text                                    
##   <chr>                    <dbl>  <dbl> <chr>                                   
## 1 Gone With The Wind           1   1930 "Gone with the Wind, novel by Margaret …
## 2 The Naked And The Dead       2   1940 "Hailed as one of the finest novels to …
## 3 The Caine Mutiny             3   1950 "Winner of the Pulitzer Prize and a per…
## 4 To Kill A Mockingbird        4   1960 "Harper Lee's Pulitzer Prize-winning ma…
## 5 Trinity                      5   1970 "The \"terrible beauty\" that is Irelan…
## 6 The Covenant                 6   1980 "James A. Michener’s masterly chronicle…

Luckily, ggpage has a function to break up the text into lines for us. Now there are multiple rows per book (there should be 104 rows total).

book_summary <- book_summary %>% 
  group_by(book_no) %>% 
  nest_paragraphs(., text)

head(book_summary)

##                                                                                text
## 1     Gone with the Wind, novel by Margaret Mitchell, published in 1936. \\It won a
## 2   Pulitzer Prize in 1937. \\Gone with the Wind is a sweeping romantic story about
## 3 the American Civil War from the point of view of the Confederacy. \\In particular
## 4  it is the story of Scarlett O’Hara, a headstrong Southern belle who survives the
## 5     hardships of the war and afterward manages to establish a successful business
## 6   by capitalizing on the struggle to rebuild the South. \\Throughout the book she
##                title book_no decade
## 1 Gone With The Wind       1   1930
## 2 Gone With The Wind       1   1930
## 3 Gone With The Wind       1   1930
## 4 Gone With The Wind       1   1930
## 5 Gone With The Wind       1   1930
## 6 Gone With The Wind       1   1930

This took me a minute to figure out. I wanted to group the text by book, so that each little page is from one book. At first, it was not doing that at all. Then I saw there is a page.col specification, and I tried to use the “title” column for that and it didn’t work. I finally realized that whatever column you use for this just needs to be numeric, so I just assigned them 1-10 by decade (e.g., 1= 1930). I’m also telling it to create 5 columns, that way I can have two rows of five books/decades each.

book_par <- book_summary %>% 
  ggpage_build(page.col = "book_no", ncol = 5)

Check out the first plot - everything looks to be in order and separated correctly by book. I guess I would have preferred to have 1-5 going across the top row and 6-10 across the bottom, but it doesn’t bother me enough to mess around with this more.

book_par %>% 
  ggpage_plot(paper.show = TRUE, page.number = "top-left")

Sentiment Analysis Prep

Now I want to do sentiment analysis. These can be read in from the Tidytext package. The NRC sentiment dataset has 10 different sentiments:

nrc_df <- get_sentiments(lexicon = "nrc")

count(nrc_df, sentiment)

## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1246
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3318
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230

The only thing is, I know from prior experience that some words are connected to more than one sentiment. I don’t want that because I was planning on mapping the sentiments to colors. So to be fair, I’ll group by word and then choose a random sentiment so that I don’t pick them in alphabetical order.

nrc_slim <- nrc_df %>% 
  group_by(word) %>% 
  sample_n(1) %>% 
  ungroup()

Join the sentiments onto the book paragraph dataset.

book_par <- book_par %>% 
  left_join(nrc_slim) %>% 
  mutate(sentiment = str_to_title(sentiment))

Now try plotting again. This time, each word is assigned to a color. The words that weren’t matched to sentiments are gray, so the colored sentiment words really stand out. This looks pretty cool, so I will move on to styling the plot.

Note: I knew I wanted to use annotations to label the books, but I wasn’t sure what was going on with the plot area, axes, etc. So I just tried out adding a ggplot theme to it, and that allowed me to see what the x and y axis scales were.

book_par %>% 
  ggpage_plot(ggplot2::aes(fill = sentiment, label = word))+
  theme_light()

Some more formatting for plots

#color palette for sentiments
color.pal2 <- c("#E15759", #anger (red)
                "#F28E2B", #anticipation (orange)
                "#B6992D", #disgust (brownish yellow)
                "#59A14F", #fear (green)
                "#F1CE63", #joy (yellow)
                "#499894", #negative (teal blue)
                "#D37295", #positive (dark pink)

                "#4E79A7", #sadness (blue)
                "#B07AA1", #surprise (purple)
                "#D4A6C8", #(light pink)
                
                #"#F0DAA8"
                "#BAB0AC"
                )


#this will help make the legend nicer and get rid of the gray NA square
sent.chars <- book_par %>% 
  filter(!is.na(sentiment)) %>%  
  distinct(sentiment) %>% 
  arrange(sentiment) %>% 
  pull(sentiment)

#making a tibble to use for the annotations, creating a label
title_df <- book_summary %>% 
  left_join(author_full %>% distinct(author, title)) %>% 
  mutate(label = paste0(decade, "s",  "\n", title, "\n", author)) %>% 
  distinct(title, label)

#assigning the x and y coordinates
## I just eyeballed it and then made some adjustments. 
title_df <- title_df %>% 
  mutate(x = c(125, 125,
               218, 218,
               305, 305,
               397, 397,
               487, 487),
         y = c(rep(c(-90, -200), 5)))

Final Books by Decade Plot

It took a lot of formatting, but the end result looks pretty good!

p2 <- book_par %>%
  #mutate(sentiment = factor(sentiment)) %>% 
  mutate(sentiment = fct_explicit_na(factor(sentiment), na_level = "No Sentiment")) %>%
  ggpage_plot(ggplot2::aes(fill = sentiment, label = word)) +
  scale_y_continuous(limits = c(-300,-75)) +
  scale_fill_manual(values = color.pal2#, breaks = sent.chars
                    )+
  annotate(
    "text",
    x = title_df$x,
    y = title_df$y,
    label = title_df$label,
    size = 4,
    family = "Courier Prime",
    fontface = "bold"
  ) +
  annotate(
    "text",
    x = 300,
    y = -300,
    label = "Sentiment Analysis of Amazon Book Descriptions",
    family = "Courier Prime",
    size = 7,
    fontface = "bold"
  ) +
  labs(title = "NYT Bestsellers: Book of the Decade",
       subtitle = "Longest Lasting Book per Decade by Number of Weeks on the List",
       caption = "Source: Post45 Data Collective") +
  theme_wsj() +
  theme(
    legend.position = "bottom",
    legend.key.size = unit(.5, "cm"),
    legend.box.margin = margin(0, 0, 0, 0),
    legend.margin = margin(0, 0, 0, 0),
    legend.text = element_text(size = 12, family = "Courier Prime"),
    legend.title = element_blank(),
    plot.title = element_text(size = 32, hjust = 0.5),
    plot.subtitle = element_text(size = 18, hjust = 0.5),
    plot.caption = element_text(size = 12),
    panel.grid.major = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  guides(fill = guide_legend(nrow = 3, ncol = 5, byrow = T))

p2

Final Books by Decade Interactive Plot

Now since I included the label in the aes(), I can feed this ggplot into ggplotly as well. Then you should be able to hover over each “word” and see what the actual word was. I’m getting a message that the horizontal legend is not supported yet, which I didn’t think was true, or maybe it’s only for ggplotly? In any case, at least the interactivity is there.

ggplotly(p2) %>%
  layout(
    title = list(
      "text" = paste0(
        "NYT Bestsellers: Book of the Decade",
        '<br>',
        '<sup>',
        "Longest Lasting Book per Decade by Number of Weeks on the List"
      ),
      font = list(family = "Courier Prime",
                  size = 24)
    ),
    
    legend = list(
      x = 0,
      xanchor = 'left',
      yanchor = 'bottom',
      orientation = 'h',
      font = list(size = 14),
      title = list(text = "")
    )
  )

I could probably do some more formatting for the ggplotly chart but it’s close enough for now.

Posted on:: May 11, 2022

Length:: 423 minute read, 90033 words

Categories:: TidyTuesday

See Also: