Analytics X3: Text Analysis R, Python, and Julia
Part 1
By Katie Press
August 14, 2022
Hello Quarto!
In my last AnalyticsX3 blog post (actually the first one ever), I complained about not having a good way to present the code and output for all three languages in one document. Thankfully, Quarto has solved that for me, so this is the first time I’m blogging using Quarto! There are still some issues I need to work out — namely, the formatting of the Quarto blog posts matching the rest of the website because I still want to use my Hugo Apero theme. Also, some of the code formatting is off, especially the Python and Julia functions. However, in addition to three languages in one post, I also have code folding and copy/paste for code chunks!
Instructions for Setting Up
First, make sure you have Python and Julia installed on your system. I am using a Macbook Pro with an M2 chip, and I had some issues with Julia until I installed the latest release (1.8.0).
- Install the reticulate package in RStudio for Python
- Install the JuliaCall package in Rstudio for Julia
I ended up installing the development version of JuliaCall from GitHub before I upgraded to Julia 1.8.0, so I’m not sure if it’s completely necessary but thought I would mention it in case anyone has issues. I also had to manually point my RStudio session to my Julia install, for some reason it just wasn’t finding it automatically.
Code
options(JULIA_HOME = "/Applications/Julia-1.8.app/Contents/Resources/julia/bin")
Once you create your .qmd file, all you need to do to run it with all three languages is make sure to specify the correct language in each code chunk. It’s that easy!
Packages
Load the packages for all three languages. These are mostly related to data cleaning and/or text analysis, plus wordcloud packages of course.
Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Code
library(tidytext)
library(wordcloud2)
library(cld3)
Code
using CSV, DataFrames, Random, Languages, WordCloud
Code
import pandas as pd, numpy as np, matplotlib.pyplot as plt, string, random
from wordcloud import WordCloud
from matplotlib.colors import ListedColormap
# Import packages for sentiment analysis
import nltk
import ssl
try:
= ssl._create_unverified_context
_create_unverified_https_context except AttributeError:
pass
else:
= _create_unverified_https_context
ssl._create_default_https_context
nltk.download(["stopwords",
"twitter_samples",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt"
])
True
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/katiepress/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data] /Users/katiepress/nltk_data...
[nltk_data] Package twitter_samples is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /Users/katiepress/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/katiepress/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] /Users/katiepress/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Code
from nltk.tokenize import RegexpTokenizer
from langdetect import detect
Read Data
The data is a set of 100k+ tweets that I scraped using the Tweepy library in Python. They all have hashtags related to days of the week, for example, “#mondayvibes”, “#tuesdaythoughts”, “#fridayfeeling”. There are several hashtags for each tweet. The “day” column is for the day of the hashtag used to scrape the data, not the actual day the tweet was created. Reading the data in is pretty simple, and similar in all three languages. I’m also creating a color palette so I can use the same colors for all three wordclouds.
Using tidyverse.
Code
<- read_csv("final_tweet_data.csv", show_col_types = FALSE)
tweet_data head(tweet_data)
# A tibble: 6 × 13
created_at day author_id id clean…¹ hasht…² num_m…³ num_u…⁴
<dttm> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 2022-08-11 19:17:07 Thursday 4977 1.56e18 happy … monday… 1 3
2 2022-08-07 23:35:46 Monday 11364 1.56e18 i foun… monday… 0 3
3 2022-08-11 18:17:31 Sunday 12305 1.56e18 the ro… monday… 1 0
4 2022-08-08 02:57:50 Sunday 35603 1.56e18 the sa… signup… 1 0
5 2022-08-08 16:58:22 Sunday 290883 1.56e18 in cas… monday… 2 0
6 2022-08-13 18:26:00 Saturday 483223 1.56e18 happen… musebo… 1 0
# … with 5 more variables: num_annotations <dbl>, num_hashtags <dbl>,
# num_emojis <dbl>, num_words <dbl>, text <chr>, and abbreviated variable
# names ¹clean_text, ²hashtags, ³num_mentions, ⁴num_urls
# ℹ Use `colnames()` to see all variable names
Code
#colors for charts
= c("#4e79a7","#f28e2c","#e15759",
tableau10 "#76b7b2","#59a14f","#edc949",
"#af7aa1","#ff9da7","#9c755f","#bab0ab")
Using DataFrames.
Code
= CSV.read("final_tweet_data.csv", DataFrame);
tweet_data first(tweet_data, 6)
6×13 DataFrame
Row │ created_at day author_id id clean_text ⋯
│ String31 String15 Float64 Float64 String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2022-08-11T19:17:07Z Thursday 4977.0 1.55781e18 happy afternoon ⋯
2 │ 2022-08-07T23:35:46Z Monday 11364.0 1.55642e18 i found this hil
3 │ 2022-08-11T18:17:31Z Sunday 12305.0 1.55779e18 the romance you
4 │ 2022-08-08T02:57:50Z Sunday 35603.0 1.55647e18 the saddleback c
5 │ 2022-08-08T16:58:22Z Sunday 290883.0 1.55669e18 in case you miss ⋯
6 │ 2022-08-13T18:26:00Z Saturday 483223.0 1.55852e18 happening now is
9 columns omitted
Code
#colors for charts
= ("#4e79a7","#f28e2c","#e15759",
tableau10 "#76b7b2","#59a14f","#edc949",
"#af7aa1","#ff9da7","#9c755f","#bab0ab")
("#4e79a7", "#f28e2c", "#e15759", "#76b7b2", "#59a14f", "#edc949", "#af7aa1", "#ff9da7", "#9c755f", "#bab0ab")
Using Pandas.
Code
= pd.read_csv("final_tweet_data.csv")
tweet_data
tweet_data.head()
#colors for charts
created_at ... text
0 2022-08-11T19:17:07Z ... Happy #thursdayvibes afternoon #LongBeachCA ne...
1 2022-08-07T23:35:46Z ... I found this hilarious #MondayMood 😂😂 https://...
2 2022-08-11T18:17:31Z ... RT @KonnectShail: The romance you can excel at...
3 2022-08-08T02:57:50Z ... RT @Rainmaker1973: The saddleback caterpillar ...
4 2022-08-08T16:58:22Z ... RT @terryelaineh1: 👋In case you missed it👇\n\n...
[5 rows x 13 columns]
Code
= ["#4e79a7","#f28e2c","#e15759",
tableau10 "#76b7b2","#59a14f","#edc949",
"#af7aa1","#ff9da7","#9c755f","#bab0ab"]
= ListedColormap(tableau10) tableau10
Detect Tweet Language
I have over 100k tweets in this dataset, and I know a good portion of them probably aren’t in English. Since this is a text analysis, I’m going to get rid of as many of those tweets as possible. The speed of this analysis varies by language/package (see specific tabs for info).
The cld3 package allows me to use Google’s Compact Language Detector 3, which according to the documentation is a neural network model for language identification. It’s very easy just to apply the detect_language function to a vector (the dataframe column), create a new column with the result, and then filter based on that column. This code only took about three seconds to run.
Code
$language = detect_language(text = tweet_data$clean_text)
tweet_data<- tweet_data |> filter(language == "en") tweet_data
From the Languages.jl package, this is a “trigram-based model”, according to the documentation it is heavily based on the rust package whatlang-rs. For Julia I’m just going to write a little function and then use transform, which will use the function on the dataframe column and create another column with the result. Then filter to English language tweets. This code took a little over one minute to run.
Code
= LanguageDetector() detector
LanguageDetector()
Code
# Function to detect if text is English
detect_my_language(x) = try
return string(detector(x)[1])
catch;"unknown" end
detect_my_language (generic function with 1 method)
Code
transform!(tweet_data, :clean_text => ByRow(x -> detect_my_language(x)) => :lang_detected);
subset!(tweet_data, :lang_detected => ByRow(lang_detected -> lang_detected == "Languages.English()"));
Python is going to be similar to Julia, I’ll write a function, use apply, and then filter on the resulting column. The Python code took the longest, about five minutes to run!
Code
# Function to detect if text is English
def detect_my_language(x):
try:
return detect(x)
except:
return 'unknown'
'language'] = tweet_data['clean_text'].apply(detect_my_language)
tweet_data[
= tweet_data[tweet_data['language'] == 'en'] tweet_data
Removing Stopwords
The process is basically the same in all three languages. First, get stopwords and add custom stopwords. Then split into a list of words, and remove stopwords from the list.
Tidytext has a built-in dataset (tibble) instead of a list, so I will add custom words as a tibble.
Code
#get stopwords
<- stop_words |>
stop_words bind_rows(tibble(
word = c(
"monday", "tuesday", "wednesday", "thursday",
"friday", "saturday", "sunday", "day", "days"
)))
#split words and remove stopwords
<- tweet_data |>
word_df unnest_tokens(word, clean_text) |>
anti_join(stop_words)
Joining, by = "word"
The reduce function with split unpacks the tweets into a string vector of single words. Then I can subset the list to take all the words that are NOT in the stopwords list.
Code
#get stopwords
= stopwords(Languages.English());
stop_words append!(stop_words, ["day", "days", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]);
#split words
= reduce(vcat, split.(tweet_data.clean_text, ' ', keepempty=false));
tweet_words
#remove stopwords
= tweet_words[(!in).(tweet_words,Ref(stop_words))]; tweet_stopped
Split/tokenize the words, then use apply with a lambda function to return only the words NOT in the stopwords list.
Code
#get stopwords
= nltk.corpus.stopwords.words("english")
stopwords 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday',
stopwords.extend(['day', 'days'])
#split words
= RegexpTokenizer('\w+')
wordsplit 'text_token']= tweet_data['clean_text'].apply(wordsplit.tokenize)
tweet_data[
#remove stopwords
'text_token'] = tweet_data['text_token'].apply(lambda x: [item for item in x if item not in stopwords]) tweet_data[
Final Prep for Wordcloud
I want to get rid of any words that have fewer than three characters in length, which will help get rid of some of the extra junk. Then I just have to finish processing the data into a format I can use for each language’s wordcloud package.
My R data is going to stay in a dataframe, so I’ll get rid of the short words. I also need a column with the word counts.
Code
<- word_df |>
word_df filter(nchar(word) > 2) |>
count(word, sort = T)
For Julia’s wordcloud package, I need to filter to the top 100 words, because the first time around I didn’t do that, and it took forever and the resulting wordcloud was huge!
Code
#turn the list back into a dataframe
= DataFrame(words = tweet_stopped);
word_df
#get rid of short words
subset!(word_df, :words => ByRow(words -> length(words) >= 3));
#final counts
= sort(combine(groupby(word_df, "words"), nrow), Cols("nrow"), rev=true);
final_counts
#get the top 100 words
= final_counts[1:100, [:words, :nrow]]; top100_df
For the Python wordcloud function, I need one long list of words.
Code
#get rid of short words
'text_token'] = tweet_data['text_token'].apply(lambda x: ' '.join([item for item in x if len(item)>2]))
tweet_data[
#turn all words into one long list
= ' '.join([word for word in tweet_data['text_token']]) all_words
Final Wordclouds
R
I like using the wordcloud2 papckage in R because it automatically generates wordclouds you can hover over to see the counts of the words. And it’s pretty easy to use.
Code
set.seed(123)
wordcloud2(data = word_df, color = rep(tableau10, 3300))
Julia
I only found one wordcloud package for Julia, and I couldn’t figure out how to actually get it to print out inline in the Quarto doc. It works when I use a notebook in VSCode though. In any case, I’ll just add it as an image below.
Random.seed!(222)
= wordcloud(top100_df.words, top100_df.nrow, colors = tableau10) |> generate! wc1
Python
I think Python is my least favorite language for plotting.
Code
123)
random.seed(
=(10,10)) plt.subplots(figsize
(<Figure size 1000x1000 with 1 Axes>, <AxesSubplot:>)
Code
= WordCloud(width = 2000, height = 2000, background_color="white", colormap=tableau10).generate(all_words)
wordcloud
plt.imshow(wordcloud)"off") plt.axis(
(-0.5, 1999.5, 1999.5, -0.5)
Code
=0, y=0)
plt.margins(x plt.show()
What’s next?
Hopefully soon I’ll move on to doing sentiment analysis, and maybe some prediction or clustering.
- Posted on:
- August 14, 2022
- Length:
- 96 minute read, 20279 words
- See Also: