The Office Part II: The Smartest Guys In The Room

July 12, 2021

Background

In The Office Part 1 I used a fake dataset which was good for what I wanted to demonstrate, but not as interesting as using real data. Text data is pretty fun to work with (at least I think so) and I decided to use the Enron email data for this post (title inspired by the book The Smartest Guys In The Room). This dataset was made publicly available during the Federal investigation into Enron’s accounting/business practices. You can download the original data at the above link, but in its current format it’s not super usable, so I’m going to clean it up first using the tidytext package.

Disclaimer: this post might not be 100% reproducible because of the amount of data and time it takes to process, but further down I will provide a link to the almost tidy dataset that can be used for tidy text analysis. If you’re trying to work with the data and having any issues, just message me and I’ll try to help.

Getting and Cleaning Messy Text Data

First get a list of all the files in the mail directory folder. Using recursive = TRUE will give the paths to all the subfolders, and full.names = TRUE will help to read in the data using map_dfr.

After getting the list, I’m subsetting it to only include one person’s mailbox for the purposes of this example. The very last person in the list has 557 emails to read in, which is not many compared to some of the other employees.

Then I can use map_dfr to read all of the emails into one csv all at one time, using read_delim because I want to use tab instead of comma as the delimiter. Otherwise I’ll be missing some of the data (like the email date). I know that if I read them in as a csv, there will only be one column and all the data will be stored in the rows, so I’m naming the column “data” so that they will all stack on top of each other and I don’t end up with 557 columns. The .id just tells me which file the data came from, which is going to be helpful for cleaning.

files <- list.files(path = "~/Desktop/Rproj/enron/maildir/zufferli-j", recursive = TRUE, full.names = TRUE)

mail_df <- str_subset(files, "zufferli-j") %>%
  map_dfr(
    read_delim,
    delim = "\t",
    col_names = c("data"),
    escape_double = FALSE,
    trim_ws = TRUE,
    .id = "source"
  ) %>% 
  filter(!is.na(data))

Clean Header Data

After reviewing the data, I noticed that the message header pretty much always(?) ends with the “X-FileName” row. I’m going to use that to separate the header data and make a tidy dataframe out of it. I’m flagging each row that contains “X-FileName” and using the cumulative sum function to basically make a column that will allow me to filter the email in two parts (message header and body).

If you scroll down in this table you can see where the 0 flag ends and the 1 flag begins to indicate the start of the message body.

mail_df <- mail_df %>%
  mutate(header_flag = ifelse(str_detect(lag(data), "X-FileName:"), 1, 0)) %>%
  mutate(header_flag = replace_na(header_flag, 0)) %>%
  group_by(source) %>%
  mutate(header_flag = cumsum(header_flag)) %>%
  ungroup()

Now I can filter to only include the header info. Some emails have a lot of rows in the header due to CCs and BCCs, so I’m going to use str_extract to make a new column with just the variables I want in my tidy dataset. These will become the new column names.

temp <- mail_df %>%
  filter(header_flag == 0) %>%
  mutate(
     new_col = str_extract(
      data,
      "Message-ID:|Date:|From:|To:|Subject:|Cc:|Bcc:|X-From:|X-To:|X-cc:|X-bcc:|X-Folder:|X-Origin:|X-FileName:"
    )
  )

Tidy the Header Data

Finally, I will filter out any rows that don’t contain info I want to use. I’m also using str_remove to get rid of the new future column names in the data column so the data will be cleaner when it arrives at its final destination. Then I just use pivot_wider to spread out my new_col column to the column names and fill them with the corresponding data cells.

tidy_mail <- temp %>% 
  ungroup() %>% 
  filter(!is.na(new_col)) %>%
  mutate(data = str_trim(str_remove(data, new_col), side = "both")) %>% 
  select(-header_flag) %>% 
  pivot_wider(names_from = new_col, values_from = data) %>% 
  clean_names()

Now I have a tidy dataset with one row for each email. Each variable is a column and each value is stored in a cell.

<1812944.1075857796215.JavaMail.evans@thyme>

Sat, 9 Sep 2000 15:01:00 -0700 (PDT)

davidpsmith@att.net

jamills@storm.ca

new email address

kzweifel@indiana.edu, john.zufferli@enron.com, kyounger@lmumail.lmu.edu,

davidpsmith@att.net

jamills@storm.ca (Jenn Mills)

kzweifel@indiana.edu (kurty-kurt zweifel), john.zufferli@enron.com (john zufferli), kyounger@lmumail.lmu.edu (Kelly Younger), Ryan.Wormley@ceridian.com (ryan wormley), Mike.Wolfe@ernexinc.com (Mikey Wolfe), brett.x.Vowles@msg.ameritech.com (brett Vowles), dvvalenti@aol.com (dennis valenti), jtower@istar.ca (Jason Tower), crikstephens@hotmail.com (Chris Stephens), tsmith@agraee.com (Tim Smith), swbb@nwlink.com (scotty smith), joshsilver@canada.com (Josh Silver), jdsilver1@mail.sprint.ca (Jon Silver), joanna.silver@redklay.com (Joanna Kubow Silver), syscokid@bellsouth.net (Jeff Silver), SEANSH@Attachmate.com (Sean Shiotani), darcy@dialin.net (darcy shaw), dshaver@pacbell.net (dan SHAVER), jrosen@cadiesel.com (JASON ROSEN), BMRaynor@mq.psd.k12.ca.us (Beth Raynor), john@poyser.com (John Poyser), Kerry.Pond@edu.gov.on.ca (Kerry Pond), lpeden@westarinsurance.com (Lisa Peden)

\John_Zufferlie_Dec2000\Notes Folders\All documents

Zufferlie-J

jzuffer.nsf

<2735282.1075857796237.JavaMail.evans@thyme>

Thu, 4 May 2000 11:10:00 -0700 (PDT)

kwright@momentumcars.com

john.zufferli@enron.com

momentum motor cars

"Kenneth Wright" <kwright@momentumcars.com>

<john.zufferli@enron.com>

\John_Zufferlie_Dec2000\Notes Folders\All documents

Zufferlie-J

jzuffer.nsf

<28390332.1075842024763.JavaMail.evans@thyme>

Tue, 5 Feb 2002 09:43:01 -0800 (PST)

john.zufferli@enron.com

Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JZUFFER>

\ExMerge - Zufferli, John\Calendar

ZUFFERLI-J

john zufferli 6-26-02.PST

<12295247.1075842024786.JavaMail.evans@thyme>

Thu, 10 Jan 2002 14:25:02 -0800 (PST)

kathy.reeves@enron.com

kathy.reeves@enron.com, laura.scott@enron.com, peggy.hedstrom@enron.com,

Updated: NETCO - Currency Issues

Reeves, Kathy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=KREEVE1>

Reeves, Kathy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Kreeve1>, Scott, Laura </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lscott>, Hedstrom, Peggy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Phedstr>, Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>, Gillis, Brian </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bgillis>, Lambie, Chris </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Clambie>

\ExMerge - Zufferli, John\Calendar

ZUFFERLI-J

john zufferli 6-26-02.PST

<18025482.1075842024808.JavaMail.evans@thyme>

Mon, 5 Nov 2001 10:49:19 -0800 (PST)

sharon.crawford@enron.com

sharon.crawford@enron.com, rob.milnthorp@enron.com,

Duke/Westcoast Transaction

Crawford, Sharon </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SCRAWFO3>

Crawford, Sharon </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Scrawfo3>, Milnthorp, Rob </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rmilnth>, Hemstock, Robert </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rhemsto>, Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Calendar

ZUFFERLI-J

john zufferli 6-26-02.PST

<24868483.1075842024831.JavaMail.evans@thyme>

Wed, 31 Oct 2001 10:20:31 -0800 (PST)

sharon.crawford@enron.com

sharon.crawford@enron.com, rob.milnthorp@enron.com,

Duke/Westcoast Transaction

Crawford, Sharon </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SCRAWFO3>

\ExMerge - Zufferli, John\Calendar

ZUFFERLI-J

john zufferli 6-26-02.PST

Clean The Message Body

Now get all the rows that have message body info, group by source, and summarise the rows containing the message body into a list column. Then the message body column can be added back onto the tidy dataset. The message body text can be cleaned later.

tidy_mail <- mail_df %>% 
  filter(header_flag == 1) %>% 
  group_by(source) %>% 
  summarise(message_body = list(data)) %>% 
  right_join(tidy_mail)

david_p_smith@hotmail.com,davidpsmith@worldnet.att.net null & void as of sept13th.,later,smith:)

<1812944.1075857796215.JavaMail.evans@thyme>

Sat, 9 Sep 2000 15:01:00 -0700 (PDT)

davidpsmith@att.net

jamills@storm.ca

new email address

kzweifel@indiana.edu, john.zufferli@enron.com, kyounger@lmumail.lmu.edu,

davidpsmith@att.net

jamills@storm.ca (Jenn Mills)

\John_Zufferlie_Dec2000\Notes Folders\All documents

Zufferlie-J

jzuffer.nsf

Sorry I missed call. My understanding is that it went well.,Let me know if you need anything,Frank

<29945908.1075842020454.JavaMail.evans@thyme>

Wed, 6 Feb 2002 11:38:22 -0800 (PST)

frank.hayden@enron.com

john.zufferli@enron.com

Hayden, Frank </O=ENRON/OU=NA/CN=RECIPIENTS/CN=FHAYDEN>

Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Deleted Items

ZUFFERLI-J

john zufferli 6-26-02.PST

100

-----Original Message-----,From: webmaster@cera.com@ENRON,Sent: Tuesday, January 29, 2002 8:00 PM,To: clients@cera.com,Subject: Mexico Executive Retreat, Presentation Summaries - CERA Report,Title: Mexico Executive Retreat, Presentation Summaries,URL(s):,http://www.cera.com/eprofile?u=35&m=2959;,***********************************************************************,As part of the Mexico Energy Advisory Service, CERA presents presentations from,the Mexico Executive Retreat held on Wednesday January 16, 2002. The links in,the report include CERA's presentation and other featured presentations.,Some of the presentations may take up to a minute to load in your browser.,**end**,Follow above URL to access the presentation files.,E-mail Category: Report,CERA Knowledge Area(s): Mexico Gas & Power,*****************************************************************************************,CERAWeek2002 - February 11-15, 2002 - 21st Annual Executive Conference and,Related Events REGISTER ONLINE TODAY AT: http://www.cera.com/ceraweek,** Two- and Four-day Passes Available,** Special Enrollment Programs,** Partnership Opportunities,** CERAWeek Online Premium Access,*****************************************************************************************,To make changes to your cera.com profile go to:,http://www.cera.com/client/updateaccount,Forgot your username and password? Go to:,http://www.cera.com/client/forgot,This electronic message and attachments, if any, contain information,from Cambridge Energy Research Associates, Inc. (CERA) which is,confidential and may be privileged. Unauthorized disclosure, copying,,distribution or use of the contents of this message or any attachments,,in whole or in part, is strictly prohibited.,Terms of Use: http://www.cera.com/tos,Questions/Comments: webmaster@cera.com,Copyright 2001. Cambridge Energy Research Associates

<16137102.1075842023021.JavaMail.evans@thyme>

Wed, 30 Jan 2002 12:46:57 -0800 (PST)

angela.mcculloch@enron.com

jason.biever@enron.com, stephane.brodeur@enron.com, steven.burnham@enron.com,

FW: Mexico Executive Retreat, Presentation Summaries - CERA Report

McCulloch, Angela </O=ENRON/OU=NA/CN=RECIPIENTS/CN=AMCCULL>

Biever, Jason </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jbiever>, Brodeur, Stephane </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbrodeu>, Burnham, Steven </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sburnham>, Cooke, Ian </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Icooke>, Cowan, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mcowan1>, Davies, Derek </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavies>, Dorland, Chris </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Cdorlan>, Draper, Lon </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ldraper>, Drozdiak, Dean </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddrozdia>, Dunsmore, Paul </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pdunsmor>, Hrap, Gerry </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ghrap>, Kitagawa, Kyle </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Kkitaga>, Lalani, Sean </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Slalani>, Laporte, Nicole </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Nlaporte>, Macphee, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mmacphee>, Milnthorp, Rob </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rmilnth>, Richey, Cooper </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Crichey>, Rodger, Paul </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Prodger>, Sangwine, Howard </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Hsangwi>, Savidant, Michael </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Msavidan>, Taylor, Fabian </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ftaylor>, Torres, Carlos </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ctorres>, Watt, Ryan </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rwatt>, Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Deleted Items

ZUFFERLI-J

john zufferli 6-26-02.PST

101

Here is an example of the trading maps we will be putting together.,Frank,-----Original Message-----,From: David.OConnor@ubsw.com [mailto:David.OConnor@ubsw.com],Sent: Wednesday, January 30, 2002 1:24 PM,To: Hayden, Frank,Subject: Trade Maps,Hey Frank,,The attached should give you an idea what our trade maps are all about.,Call w/ any questions,Rgds,,DOC

<29357167.1075842023045.JavaMail.evans@thyme>

Wed, 30 Jan 2002 12:23:47 -0800 (PST)

frank.hayden@enron.com

john.zufferli@enron.com

FW: Trade Maps

Hayden, Frank </O=ENRON/OU=NA/CN=RECIPIENTS/CN=FHAYDEN>

Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Deleted Items

ZUFFERLI-J

john zufferli 6-26-02.PST

102

Looks good. Feel free to write a couple of bullets regarding the types of instruments/commodities you believe are necessary, as this will help me flesh out ideas with UBS regarding trading universe.,-----Original Message-----,From: Zufferli, John,Sent: Wednesday, January 30, 2002 2:16 PM,To: Hayden, Frank,Subject: var limits,<< File: VarLimits.xls >>

<21547849.1075842023068.JavaMail.evans@thyme>

Wed, 30 Jan 2002 12:22:00 -0800 (PST)

frank.hayden@enron.com

john.zufferli@enron.com

RE: var limits

Hayden, Frank </O=ENRON/OU=NA/CN=RECIPIENTS/CN=FHAYDEN>

Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Deleted Items

ZUFFERLI-J

john zufferli 6-26-02.PST

103

I'll grab sandwich and call you,-----Original Message-----,From: Zufferli, John,Sent: Wednesday, January 30, 2002 12:23 PM,To: Hayden, Frank,Subject: RE: UBS Trade Products,can you call me regarding this,-----Original Message-----,From: Hayden, Frank,Sent: Tuesday, January 29, 2002 4:18 PM,To: Belden, Tim; Presto, Kevin M.; Zufferli, John; Lavorato, John,Cc: Gossett, Jeffrey C.; White, Stacey W.; Postlethwaite, John; Reeves, Kathy,Subject: UBS Trade Products,My apologies if this has been done, but I'm in process of assembling list of products that UBS will be trading. If this list has been compiled, please direct it to me.,I'm interested in getting the greatest granularity possible breaking it out by VaR portfolio name, trader, trading book, commodity, instrument, location, tenor, relative liquidity for each instrument expressed in contract/day (i.e. could impact holding period for VAR) and best risk mitigator. (Regarding best risk mitigator, I'm not looking for liquidating position comments, but rather best hedge given curve location.) This will help in directing correlation efforts for VAR. See attached spreadsheet for suggested format.,Thanks,,Frank,<< File: Trading Universe.xls >>

<1728966.1075842023093.JavaMail.evans@thyme>

Wed, 30 Jan 2002 10:29:51 -0800 (PST)

frank.hayden@enron.com

john.zufferli@enron.com

RE: UBS Trade Products

Hayden, Frank </O=ENRON/OU=NA/CN=RECIPIENTS/CN=FHAYDEN>

Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>

\ExMerge - Zufferli, John\Deleted Items

ZUFFERLI-J

john zufferli 6-26-02.PST

I highly recommend you do NOT do this for the full dataset, because it’s a lot of data, takes a long time, and there are a couple of other issues that had to be handled that I’m not covering in this post. Basically what I did was put all of the above code into a function that I passed over the level one mail directories, saving each person’s folder into an .RDS format. I made a couple of adjustments to some problematic files, and then read them all into one dataset using purrr’s map_dfr.

save_mail <- function(name){

name.tosave <- paste0("tidy_mail/" , name, "_tidy.RDS")

df <- str_subset(files, paste0(name)) %>%
  map_dfr(
    read_delim,
    delim = "\t",
    col_names = c("data"),
    escape_double = FALSE,
    trim_ws = TRUE,
    .id = "source"
  ) %>% 
  filter(!is.na(data))

df <- df %>%
  mutate(header_flag = ifelse(str_detect(lag(data), "X-FileName:"), 1, 0)) %>%
  mutate(header_flag = replace_na(header_flag, 0)) %>%
  group_by(source) %>%
  mutate(header_flag = cumsum(header_flag)) %>%
  ungroup()

temp <- df %>%
  filter(header_flag == 0) %>%
  mutate(
     new_col = str_extract(
      data,
      "Message-ID:|Date:|From:|To:|Subject:|Cc:|Bcc:|X-From:|X-To:|X-cc:|X-bcc:|X-Folder:|X-Origin:|X-FileName:"
    )
  )

tidy_mail <- temp %>% 
  ungroup() %>% 
  filter(!is.na(new_col)) %>%
  mutate(data = str_trim(str_remove(data, new_col), side = "both")) %>% 
  select(-header_flag) %>% 
  pivot_wider(names_from = new_col, values_from = data) %>% 
  janitor::clean_names()

tidy_mail <- df %>%
  filter(header_flag == 1) %>%
  group_by(source) %>%
  summarise(message_body = list(data)) %>%
  right_join(tidy_mail)

saveRDS(tidy_mail, name.tosave)

gc()
}

#get paths for the level one folders (one per person)
files2 <-list.files("maildir") 

#map to tidy dataframe and save to .RDS format using the save_mail function
map(files2, save_mail)

To save you the trouble of doing all that, I’ve uploaded in the final tidy-ish format dataset here. The message body column is unnested into sentences so that you can perform most tidy text mining operations on it by unnesting the sentences further (into words, n-grams, etc.).

Tidy Text Analysis

Using the sentences dataset, I first filtered out all of the “sentences” (rows) that had no letters (only numbers or special characters, blank spaces). Then unnest to words (one word per row), filter by letters only again, and use anti-join to remove stop words. The stop words dataset is built in to the tidytext package, so you don’t have to do anything special if you have that package loaded, just use it in your code and it will show up. I then did a word count to look for superfluous words that I could exclude to maybe cut down further on the size of the dataset. I filtered out words with special characters, and words with 15 or more letters, since those are usually just long strings of nonsense (forwarded emails, etc.).

tidy_word <- tidy_enron %>%  
  unnest_tokens(word, sentence) %>% 
  filter(str_detect(word, "[[:alpha:]]")) %>% 
  anti_join(stop_words) %>% 
  filter(!str_detect(word, "-|_|\\.|:|\\d")) %>% 
  filter(str_length(word) < 15) %>% 
  select(source, x_origin, message_id, date_clean, word)

This is what the unnested dataset looks like now.

Allen-P

<18782981.1075855378110.JavaMail.evans@thyme>

2001-05-14

forecast

Allen-P