In my earlier post on binary text classification was one of the problems that occurred was the sheer size of the data when trying to fit a model. The bag of words method of having each column describe the occurrence of a specific word in each document (row) is appealing from a mathematical perspective but gives rise to large sparse matrices which aren’t handled well by some models in R. This leads to slow running code at best and crashing at worst.
We will try to combat this problem by using something called word embedding which is a general term for the process of mapping textural information to a lower-dimensional space. This is a special case of dimensionality reduction, and we will use the simple well-known method Principal component analysis for our word embedding today. We are essentially trying to squeeze as much information into as little space as possible such that our models can run in a reasonable time.
We will use the same data as in the earlier post, and the PCA procedure is very inspired by Julia Silge recent post Understanding PCA using Stack Overflow data which you should read if you haven’t already!!
Data prepossessing
We will use the standard tidyverse
toolset for this post. We will use randomForest
model as this approach should be much faster.
library(tidyverse)
library(tidytext)
library(broom)
library(randomForest)
The data we will be using for this demonstration will be some English1 social media disaster tweets discussed in this article. It consists of several tweets regarding accidents mixed in with a selection of control tweets (not about accidents). We start by loading the data.
<- read_csv("https://raw.githubusercontent.com/EmilHvitfeldt/blog/750dc28aa8d514e2c0b8b418ade584df8f4a8c92/data/socialmedia-disaster-tweets-DFE.csv") data
And for this exercise, we will only look at the body of the text. Furthermore, a handful of the tweets weren’t classified, marked "Can't Decide"
so we are removing those as well. Since we are working with tweet data we have the constraint that most tweets don’t have that much information in them as they are limited in characters and some only contain a couple of words.
We will at this stage remove what appears to be URLs using some regex and str_replace_all
, and we will select the columns id
, disaster
, and text
.
<- data %>%
data_clean filter(choose_one != "Can't Decide") %>%
mutate(id = `_unit_id`,
disaster = choose_one == "Relevant",
text = str_replace_all(text, " ?(f|ht)tp(s?)://(.*)[.][a-z]+", "")) %>%
select(id, disaster, text)
We then extract all unigrams, bigrams and remove stopwords.
<- map_df(1:2,
data_counts ~ unnest_tokens(data_clean, word, text,
token = "ngrams", n = .x)) %>%
anti_join(stop_words, by = "word")
We will only focus on the top 10000 most used words for the remainder of the analysis.
<- data_counts %>%
top10000 count(word, sort = TRUE) %>%
slice(1:10000) %>%
select(word)
we will then count the words again, but this time we will count the word occurrence within each document and remove the underused words.
<- data_counts %>%
unnested_words count(id, word, sort = TRUE) %>%
inner_join(top10000, by = "word")
We then cast the data.frame to a sparse matrix.
<- unnested_words %>%
sparse_word_matrix cast_sparse(id, word, n)
In the last post, we used this matrix for the modeling, but the size was quite an obstacle.
dim(sparse_word_matrix)
## [1] 10829 10000
We have a row for each document and a row for each of the top 10000 words, but most of the elements are empty so each of the variables doesn’t contain much information. We will do word embedding by applying PCA to the sparse word count matrix. Like Julia Silge we will use the wonderful irlba package that facilities PCA on sparse matrices. First, we scale the matrix and then we apply PCA where we request 64 columns.
This stage will take some time, but that is the trade-off we will be making when using word embedding. We take some computation time upfront in exchange for quick computation later down the line.
<- scale(sparse_word_matrix)
word_scaled <- irlba::prcomp_irlba(word_scaled, n = 64) word_pca
Then we will create a meta data.frame to take care of tweets that disappeared when we cleaned them earlier.
<- tibble(id = as.numeric(dimnames(sparse_word_matrix)[[1]])) %>%
meta left_join(data_clean[!duplicated(data_clean$id), ], by = "id")
Now we combine the PCA matrix with the proper response variable (disaster/non-disaster) with the addition of a training/testing split variable.
<- data.frame(word_pca$x) %>%
class_df mutate(response = factor(meta$disaster),
split = sample(0:1, NROW(meta), replace = TRUE, prob = c(0.2, 0.8)))
We now have a data frame with 64 explanatory variables instead of the 10000 we started with. This a huge reduction which hopefully should pay off. For this demonstration will we try using two kinds of models. Standard logistic regression and a random forest model. Logistic regression is a good baseline which should be blazing fast now since the reduction has taken place and the random forest model which generally was quite slow should be more manageable this time around.
<- glm(response ~ .,
model data = filter(class_df, split == 1),
family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
<- predict(model,
y_pred type = "response",
newdata = filter(class_df, split == 0) %>% select(-response))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
<- if_else(y_pred > 0.5, 1, 0)
y_pred_logical <- table(y_pred_logical, filter(class_df, split == 0) %>% pull(response)))
(con ##
## y_pred_logical FALSE TRUE
## 0 1163 591
## 1 70 356
sum(diag(con)) / sum(con)
## [1] 0.696789
it work fairly quickly and we get a decent accuracy of 70%. Remember this method isn’t meant to improve the accuracy but rather to improve the computational time.
<- randomForest(response ~ .,
model data = filter(class_df, split == 1))
<- predict(model,
y_pred type = "class",
newdata = filter(class_df, split == 0) %>% select(-response))
<- table(y_pred, filter(class_df, split == 0) %>% pull(response)))
(con ##
## y_pred FALSE TRUE
## FALSE 1086 383
## TRUE 147 564
sum(diag(con)) / sum(con)
## [1] 0.7568807
This one takes slightly longer to run due to the number of trees, but it does give us the nifty 76% accuracy which is pretty good considering we only look at tweets.
And this is all that there is to it! The dimensionality reduction method was able to reduce the number of variables while retaining most of the information within those variables such that we can run our procedures at a faster phase without much loss. There is still a lot of individual improvements to be done if this was to be used further, both in terms of hyper-parameter selection in the modeling choices but also the number of PCA variables that should be used in the final modeling. Remember that this is just one of the simpler methods, with more advanced word representation methods being glove and word2vec.
Data viz
Since Julia did most of the legwork for the visualizations so we will take a look at how each of the words contributes to the first four components.
<- bind_cols(Tag = colnames(word_scaled),
tidied_pca $rotation) %>%
word_pcagather(PC, Contribution, PC1:PC4)
%>%
tidied_pca filter(PC %in% paste0("PC", 1:4)) %>%
ggplot(aes(Tag, Contribution, fill = Tag)) +
geom_col(show.legend = FALSE) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(x = "Words",
y = "Relative importance in each principal component") +
facet_wrap(~ PC, ncol = 2)
What we see is quite different then what Julia found in her study. We have just a few words doing most of the contributions in each of component. Lets zoom in to take a look at the words with the most influence on the different components:
map_df(c(-1, 1) * 20,
~ tidied_pca %>%
filter(PC == "PC1") %>%
top_n(.x, Contribution)) %>%
mutate(Tag = reorder(Tag, Contribution)) %>%
ggplot(aes(Tag, Contribution, fill = Tag)) +
geom_col(show.legend = FALSE, alpha = 0.8) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x = element_blank()) +
labs(x = "Words",
y = "Relative importance in principle component",
title = "PC1")
We would like to see some sensible separation between the positive words and the negative words (concerning contribution). However, I haven’t been able to come up with a meaningful full grouping for the first 3 components. The fourth on the other hand have all the positive influencing words containing numbers in one way or another.
map_df(c(-1, 1) * 20,
~ tidied_pca %>%
filter(PC == "PC4") %>%
top_n(.x, Contribution)) %>%
mutate(Tag = reorder(Tag, Contribution)) %>%
ggplot(aes(Tag, Contribution, fill = Tag)) +
geom_col(show.legend = FALSE, alpha = 0.8) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.ticks.x = element_blank()) +
labs(x = "Words",
y = "Relative importance in principle component",
title = "PC4")
This is all I have for this time. Hope you enjoyed it!
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.1.0 (2021-05-18)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Los_Angeles
tz America2021-07-13
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
backports 1.3.2 2021-06-09 [1] Github (rstudio/blogdown@00a2090)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.1.0)
bookdown * 0.7.8 2021-06-24 [1] CRAN (R 4.1.0)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.1.0)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.1.0)
clipr 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
codetools 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
crayon 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.1.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.1.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
generics * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
haven 0.9 2021-04-16 [1] CRAN (R 4.1.0)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
httr 0.1.5 2017-06-10 [1] CRAN (R 4.1.0)
janeaustenr 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.1.0)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
labeling 0.20-44 2021-05-02 [1] CRAN (R 4.1.0)
lattice 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
lifecycle 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
magrittr 1.3-3 2021-05-04 [1] CRAN (R 4.1.0)
Matrix 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
R6 * 4.6-14 2018-03-25 [1] CRAN (R 4.1.0)
randomForest 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.1.0)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
rlang 2.9 2021-06-15 [1] CRAN (R 4.1.0)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rstudioapi 1.0.0 2021-03-09 [1] CRAN (R 4.1.0)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
sessioninfo 0.7.0 2020-04-01 [1] CRAN (R 4.1.0)
SnowballC 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
stringr * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyselect * 0.3.1 2021-04-10 [1] CRAN (R 4.1.0)
tidytext * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tidyverse 0.2.1 2018-03-29 [1] CRAN (R 4.1.0)
tokenizers 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vctrs 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
withr 0.24 2021-06-15 [1] CRAN (R 4.1.0)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library [