Analytics X3: Text Analysis R, Python, and Julia

Part 1

By Katie Press

August 14, 2022

wordcloud_all

Hello Quarto!

In my last AnalyticsX3 blog post (actually the first one ever), I complained about not having a good way to present the code and output for all three languages in one document. Thankfully, Quarto has solved that for me, so this is the first time I’m blogging using Quarto! There are still some issues I need to work out — namely, the formatting of the Quarto blog posts matching the rest of the website because I still want to use my Hugo Apero theme. Also, some of the code formatting is off, especially the Python and Julia functions. However, in addition to three languages in one post, I also have code folding and copy/paste for code chunks!

Instructions for Setting Up

First, make sure you have Python and Julia installed on your system. I am using a Macbook Pro with an M2 chip, and I had some issues with Julia until I installed the latest release (1.8.0).

  • Install the reticulate package in RStudio for Python
  • Install the JuliaCall package in Rstudio for Julia

I ended up installing the development version of JuliaCall from GitHub before I upgraded to Julia 1.8.0, so I’m not sure if it’s completely necessary but thought I would mention it in case anyone has issues. I also had to manually point my RStudio session to my Julia install, for some reason it just wasn’t finding it automatically.

Code
options(JULIA_HOME = "/Applications/Julia-1.8.app/Contents/Resources/julia/bin")

Once you create your .qmd file, all you need to do to run it with all three languages is make sure to specify the correct language in each code chunk. It’s that easy!

Packages

Load the packages for all three languages. These are mostly related to data cleaning and/or text analysis, plus wordcloud packages of course.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(tidytext)
library(wordcloud2)
library(cld3)
Code
using CSV, DataFrames, Random, Languages, WordCloud
Code
import pandas as pd, numpy as np, matplotlib.pyplot as plt, string, random
from wordcloud import WordCloud
from matplotlib.colors import ListedColormap

# Import packages for sentiment analysis
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download([
    "stopwords",
    "twitter_samples",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt"
])
True

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/katiepress/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/katiepress/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/katiepress/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/katiepress/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/katiepress/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Code
from nltk.tokenize import RegexpTokenizer
from langdetect import detect

Read Data

The data is a set of 100k+ tweets that I scraped using the Tweepy library in Python. They all have hashtags related to days of the week, for example, “#mondayvibes”, “#tuesdaythoughts”, “#fridayfeeling”. There are several hashtags for each tweet. The “day” column is for the day of the hashtag used to scrape the data, not the actual day the tweet was created. Reading the data in is pretty simple, and similar in all three languages. I’m also creating a color palette so I can use the same colors for all three wordclouds.

Using tidyverse.

Code
tweet_data <- read_csv("final_tweet_data.csv", show_col_types = FALSE)
head(tweet_data)
# A tibble: 6 × 13
  created_at          day      author_id      id clean…¹ hasht…² num_m…³ num_u…⁴
  <dttm>              <chr>        <dbl>   <dbl> <chr>   <chr>     <dbl>   <dbl>
1 2022-08-11 19:17:07 Thursday      4977 1.56e18 happy … monday…       1       3
2 2022-08-07 23:35:46 Monday       11364 1.56e18 i foun… monday…       0       3
3 2022-08-11 18:17:31 Sunday       12305 1.56e18 the ro… monday…       1       0
4 2022-08-08 02:57:50 Sunday       35603 1.56e18 the sa… signup…       1       0
5 2022-08-08 16:58:22 Sunday      290883 1.56e18 in cas… monday…       2       0
6 2022-08-13 18:26:00 Saturday    483223 1.56e18 happen… musebo…       1       0
# … with 5 more variables: num_annotations <dbl>, num_hashtags <dbl>,
#   num_emojis <dbl>, num_words <dbl>, text <chr>, and abbreviated variable
#   names ¹​clean_text, ²​hashtags, ³​num_mentions, ⁴​num_urls
# ℹ Use `colnames()` to see all variable names
Code
#colors for charts
tableau10 = c("#4e79a7","#f28e2c","#e15759",
              "#76b7b2","#59a14f","#edc949",
              "#af7aa1","#ff9da7","#9c755f","#bab0ab")

Using DataFrames.

Code
tweet_data = CSV.read("final_tweet_data.csv", DataFrame);
first(tweet_data, 6)
6×13 DataFrame
 Row │ created_at            day       author_id  id          clean_text       ⋯
     │ String31              String15  Float64    Float64     String           ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 2022-08-11T19:17:07Z  Thursday     4977.0  1.55781e18  happy afternoon  ⋯
   2 │ 2022-08-07T23:35:46Z  Monday      11364.0  1.55642e18  i found this hil
   3 │ 2022-08-11T18:17:31Z  Sunday      12305.0  1.55779e18  the romance you
   4 │ 2022-08-08T02:57:50Z  Sunday      35603.0  1.55647e18  the saddleback c
   5 │ 2022-08-08T16:58:22Z  Sunday     290883.0  1.55669e18  in case you miss ⋯
   6 │ 2022-08-13T18:26:00Z  Saturday   483223.0  1.55852e18  happening now is
                                                               9 columns omitted
Code

#colors for charts
tableau10 = ("#4e79a7","#f28e2c","#e15759",
              "#76b7b2","#59a14f","#edc949",
              "#af7aa1","#ff9da7","#9c755f","#bab0ab")
("#4e79a7", "#f28e2c", "#e15759", "#76b7b2", "#59a14f", "#edc949", "#af7aa1", "#ff9da7", "#9c755f", "#bab0ab")

Using Pandas.

Code
tweet_data = pd.read_csv("final_tweet_data.csv")
tweet_data.head()

#colors for charts
             created_at  ...                                               text
0  2022-08-11T19:17:07Z  ...  Happy #thursdayvibes afternoon #LongBeachCA ne...
1  2022-08-07T23:35:46Z  ...  I found this hilarious #MondayMood 😂😂 https://...
2  2022-08-11T18:17:31Z  ...  RT @KonnectShail: The romance you can excel at...
3  2022-08-08T02:57:50Z  ...  RT @Rainmaker1973: The saddleback caterpillar ...
4  2022-08-08T16:58:22Z  ...  RT @terryelaineh1: 👋In case you missed it👇\n\n...

[5 rows x 13 columns]
Code
tableau10 = ["#4e79a7","#f28e2c","#e15759",
              "#76b7b2","#59a14f","#edc949",
              "#af7aa1","#ff9da7","#9c755f","#bab0ab"]

tableau10 = ListedColormap(tableau10)

Detect Tweet Language

I have over 100k tweets in this dataset, and I know a good portion of them probably aren’t in English. Since this is a text analysis, I’m going to get rid of as many of those tweets as possible. The speed of this analysis varies by language/package (see specific tabs for info).

The cld3 package allows me to use Google’s Compact Language Detector 3, which according to the documentation is a neural network model for language identification. It’s very easy just to apply the detect_language function to a vector (the dataframe column), create a new column with the result, and then filter based on that column. This code only took about three seconds to run.

Code
tweet_data$language = detect_language(text = tweet_data$clean_text)
tweet_data <- tweet_data |> filter(language == "en")

From the Languages.jl package, this is a “trigram-based model”, according to the documentation it is heavily based on the rust package whatlang-rs. For Julia I’m just going to write a little function and then use transform, which will use the function on the dataframe column and create another column with the result. Then filter to English language tweets. This code took a little over one minute to run.

Code
detector = LanguageDetector() 
LanguageDetector()
Code

# Function to detect if text is English

detect_my_language(x) = try
    return string(detector(x)[1])
 catch;
    "unknown" end
detect_my_language (generic function with 1 method)
Code
    
transform!(tweet_data, :clean_text => ByRow(x -> detect_my_language(x)) => :lang_detected);

subset!(tweet_data, :lang_detected => ByRow(lang_detected -> lang_detected == "Languages.English()"));

Python is going to be similar to Julia, I’ll write a function, use apply, and then filter on the resulting column. The Python code took the longest, about five minutes to run!

Code
# Function to detect if text is English

def detect_my_language(x):
   try:
       return detect(x)
   except:
       return 'unknown'
       
tweet_data['language'] = tweet_data['clean_text'].apply(detect_my_language)

tweet_data = tweet_data[tweet_data['language'] == 'en']

Removing Stopwords

The process is basically the same in all three languages. First, get stopwords and add custom stopwords. Then split into a list of words, and remove stopwords from the list.

Tidytext has a built-in dataset (tibble) instead of a list, so I will add custom words as a tibble.

Code
#get stopwords
stop_words <- stop_words |>
  bind_rows(tibble(
    word = c(
      "monday", "tuesday", "wednesday", "thursday",
      "friday", "saturday", "sunday", "day", "days"
    )))
  
#split words and remove stopwords
word_df <- tweet_data |>  
  unnest_tokens(word, clean_text) |> 
  anti_join(stop_words)
Joining, by = "word"

The reduce function with split unpacks the tweets into a string vector of single words. Then I can subset the list to take all the words that are NOT in the stopwords list.

Code
#get stopwords
stop_words = stopwords(Languages.English());
append!(stop_words, ["day", "days", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]);

#split words
tweet_words = reduce(vcat, split.(tweet_data.clean_text, ' ', keepempty=false));

#remove stopwords
tweet_stopped = tweet_words[(!in).(tweet_words,Ref(stop_words))];

Split/tokenize the words, then use apply with a lambda function to return only the words NOT in the stopwords list.

Code
#get stopwords
stopwords = nltk.corpus.stopwords.words("english")
stopwords.extend(['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday',
 'day', 'days'])
 
#split words
wordsplit = RegexpTokenizer('\w+')
tweet_data['text_token']= tweet_data['clean_text'].apply(wordsplit.tokenize)

#remove stopwords
tweet_data['text_token'] = tweet_data['text_token'].apply(lambda x: [item for item in x if item not in stopwords])

Final Prep for Wordcloud

I want to get rid of any words that have fewer than three characters in length, which will help get rid of some of the extra junk. Then I just have to finish processing the data into a format I can use for each language’s wordcloud package.

My R data is going to stay in a dataframe, so I’ll get rid of the short words. I also need a column with the word counts.

Code
word_df <- word_df |> 
  filter(nchar(word) > 2) |> 
  count(word, sort = T) 

For Julia’s wordcloud package, I need to filter to the top 100 words, because the first time around I didn’t do that, and it took forever and the resulting wordcloud was huge!

Code
#turn the list back into a dataframe
word_df = DataFrame(words = tweet_stopped);

#get rid of short words
subset!(word_df, :words => ByRow(words -> length(words) >= 3));

#final counts 
final_counts = sort(combine(groupby(word_df, "words"), nrow), Cols("nrow"), rev=true);

#get the top 100 words
top100_df = final_counts[1:100, [:words, :nrow]];

For the Python wordcloud function, I need one long list of words.

Code
#get rid of short words
tweet_data['text_token'] = tweet_data['text_token'].apply(lambda x: ' '.join([item for item in x  if len(item)>2]))

#turn all words into one long list 
all_words = ' '.join([word for word in tweet_data['text_token']])

Final Wordclouds

R

I like using the wordcloud2 papckage in R because it automatically generates wordclouds you can hover over to see the counts of the words. And it’s pretty easy to use.

Code
set.seed(123)
wordcloud2(data = word_df, color = rep(tableau10, 3300))

Julia

I only found one wordcloud package for Julia, and I couldn’t figure out how to actually get it to print out inline in the Quarto doc. It works when I use a notebook in VSCode though. In any case, I’ll just add it as an image below.

Random.seed!(222)

wc1 = wordcloud(top100_df.words, top100_df.nrow, colors = tableau10) |> generate!

Julia Wordcloud

Python

I think Python is my least favorite language for plotting.

Code
random.seed(123)

plt.subplots(figsize=(10,10))
(<Figure size 1000x1000 with 1 Axes>, <AxesSubplot:>)
Code
wordcloud = WordCloud(width = 2000, height = 2000, background_color="white", colormap=tableau10).generate(all_words)

plt.imshow(wordcloud)
plt.axis("off")
(-0.5, 1999.5, 1999.5, -0.5)
Code
plt.margins(x=0, y=0)
plt.show()

What’s next?

Hopefully soon I’ll move on to doing sentiment analysis, and maybe some prediction or clustering.

Posted on:
August 14, 2022
Length:
96 minute read, 20279 words
See Also:
comments powered by Disqus