This code has been lightly revised to make sure it works as of 2018-12-16.
In this post Iβll go through how I created the data visualization I posted yesterday on Twitter:
Trying something new! Visualizing top trigrams in Jane Austen's Emma using #tidytext and #tidyverse! Blogpost coming soon! π€ #rstats #dataviz pic.twitter.com/Sy1fQJB5Ih
β Emil Hvitfeldt (@Emil_Hvitfeldt) January 23, 2018
What am I looking at?
So for this particular data-viz I took the novel Emma by Jane Austen, extracted all the trigrams (sentences of length 3), took the 150 most frequent ones, and visualized those.
This visualization is a layered horizontal tree graph where the 3 levels (vertical columns of words) correspond to words that appear at the nth place in the trigrams, e.g. first column has the first words of the trigram, the second column has middle words of trigrams, etc. Up to 20 words in each column are kept and they are ordered and sized according to occurrence in the data.
The curves represent how often two words co-occur, with the color representing starting word and transparency related to frequency.
All code is presented in the following gist.
Packages and parameters
We will be using the following packages:
library(tidyverse)
library(tidytext)
library(purrrlyr)
And the overall parameters outlined in description are defined here:
<- 20
n_word <- 150
n_top <- 3 n_gramming
Trigrams
If you have read Text Mining with R Iβm sure you have encountered the janeaustenr
package. We will use the Emma novel, and tidytext
βs unnest_tokens
to calculate the trigrams we need. We also specify the starting words.
<- tibble(text = janeaustenr::emma) %>%
trigrams unnest_tokens(trigram, text, token = "ngrams", n = n_gramming)
<- c("he", "she") start_words
next, we find the top 150 trigrams using count
and some regex magic. And we use those top words to filter such that we only will be looking at the top 150.
<- str_c("^", start_words, " ", collapse = "|")
pattern <- trigrams %>%
top_words filter(str_detect(trigram, pattern)) %>%
count(trigram, sort = TRUE) %>%
slice(seq_len(n_top)) %>%
pull(trigram)
<- trigrams %>%
trigrams filter(trigram %in% top_words)
Nodes
Since we know that each trigram has a sample format, we can create a simple function to extract the nth word in a string.
<- function(x, n, sep = " ") {
str_nth_word str_split(x, pattern = " ") %>%
map_chr(~ .x[n])
}
The following purrr::map_df
- Extracts the nth word in the trigram
- Counts and sorts the occurrences
- Grabs the top 20 words
- Equally space them along the y-axis
<- map_df(seq_len(n_gramming),
nodes ~ trigrams %>%
mutate(word = str_nth_word(trigram, .x)) %>%
count(word, sort = TRUE) %>%
slice(seq_len(n_word)) %>%
mutate(y = seq(from = n_word + 1, to = 0,
length.out = n() + 2)[seq_len(n()) + 1],
x = .x))
plot of node positions
Lets see the words so far:
%>%
nodes ggplot(aes(x, y, label = word)) +
geom_text()
Edges
When we look at the final visualization we see that the words are connected by curved lines. I achieved that by using a sigmoid curve and then transform it to match the starting and endpoints.
<- function(x_from, x_to, y_from, y_to, scale = 5, n = 100) {
sigmoid <- seq(-scale, scale, length = n)
x <- exp(x) / (exp(x) + 1)
y tibble(x = (x + scale) / (scale * 2) * (x_to - x_from) + x_from,
y = y * (y_to - y_from) + y_from)
}
The following function takes
- a list of trigrams
- a data.frame of βfromβ nodes
- a data.frame of βtoβ nodes
and returns a data.frame containing the data points for the curves we need to draw with correct starting and ending points.
<- function(trigram, from_word, to_word, scale = 5, n = 50,
egde_lines x_space = 0) {
<- from_word %>%
from_word select(-n) %>%
set_names(c("from", "y_from", "x_from"))
<- to_word %>%
to_word select(-n) %>%
set_names(c("to", "y_to", "x_to"))
<- crossing(from = from_word$from,
links to = to_word$to) %>%
mutate(word_pair = paste(from, to),
number = map_dbl(word_pair,
~ sum(str_detect(trigram$trigram, .x)))) %>%
left_join(from_word, by = "from") %>%
left_join(to_word, by = "to")
%>%
links by_row(~ sigmoid(x_from = .x$x_from + 0.2 + x_space,
x_to = .x$x_to - 0.05,
y_from = .x$y_from, y_to = .x$y_to,
scale = scale, n = n) %>%
mutate(word_pair = .x$word_pair,
number = .x$number,
from = .x$from)) %>%
pull(.out) %>%
bind_rows()
}
plot of the first set of edges
Letβs take a look at the first set of edges to see if it is working.
egde_lines(trigram = trigrams,
from_word = filter(nodes, x == 1),
to_word = filter(nodes, x == 2)) %>%
filter(number > 0) %>%
ggplot(aes(x, y, group = word_pair, alpha = number, color = from)) +
geom_line()
Calculating all egdes
For ease (and laziness) I have desired to calculate the edges in sections
- edges between the first and second column
- edges between the second and third column for words that start with βheβ
- edges between the second and third columns for words that start with βsheβ
and combine by the end.
# egdes between first and second column
<- egde_lines(trigram = trigrams,
egde1 from_word = filter(nodes, x == 1),
to_word = filter(nodes, x == 2),
n = 50) %>%
filter(number > 0) %>%
mutate(id = word_pair)
# Words in second colunm
## That start with he
<- nodes %>%
second_word_he filter(x == 2) %>%
select(-n) %>%
left_join(
%>%
trigrams filter(str_nth_word(trigram, 1) == start_words[1]) %>%
mutate(word = str_nth_word(trigram, 2)) %>%
count(word),
by = "word"
%>%
) replace_na(list(n = 0))
## That start with she
<- nodes %>%
second_word_she filter(x == 2) %>%
select(-n) %>%
left_join(
%>%
trigrams filter(str_nth_word(trigram, 1) == start_words[2]) %>%
mutate(word = str_nth_word(trigram, 2)) %>%
count(word),
by = "word"
%>%
) replace_na(list(n = 0))
# Words in third colunm
## That start with he
<- nodes %>%
third_word_he filter(x == 3) %>%
select(-n) %>%
left_join(
%>%
trigrams filter(str_nth_word(trigram, 1) == start_words[1]) %>%
mutate(word = str_nth_word(trigram, 3)) %>%
count(word),
by = "word"
%>%
) replace_na(list(n = 0))
## That start with she
<- nodes %>%
third_word_she filter(x == 3) %>%
select(-n) %>%
left_join(
%>%
trigrams filter(str_nth_word(trigram, 1) == start_words[2]) %>%
mutate(word = str_nth_word(trigram, 3)) %>%
count(word),
by = "word"
%>%
) replace_na(list(n = 0))
# egdes between second and third column that starts with he
<- egde_lines(filter(trigrams,
egde2_he str_detect(trigram, paste0("^", start_words[1], " "))),
n = 50) %>%
second_word_he, third_word_he, mutate(y = y + 0.05,
from = start_words[1],
id = str_c(from, word_pair, sep = " ")) %>%
filter(number > 0)
# egdes between second and third column that starts with she
<- egde_lines(filter(trigrams,
egde2_she str_detect(trigram, paste0("^", start_words[2], " "))),
n = 50) %>%
second_word_she, third_word_she, mutate(y = y - 0.05,
from = start_words[2],
id = str_c(from, word_pair, sep = " ")) %>%
filter(number > 0)
# All edges
<- bind_rows(egde1, egde2_he, egde2_she) edges
vizualisation
Now we just add it all together. All labels, change colors, adjust xlim
to fit words on the page.
<- nodes %>%
p ggplot(aes(x, y, label = word, size = n)) +
geom_text(hjust = 0, color = "#DDDDDD") +
theme_void() +
geom_line(data = edges,
aes(x, y, group = id, color = from, alpha = sqrt(number)),
inherit.aes = FALSE) +
theme(plot.background = element_rect(fill = "#666666", colour = 'black'),
text = element_text(color = "#EEEEEE", size = 15)) +
guides(alpha = "none", color = "none", size = "none") +
xlim(c(0.9, 3.2)) +
scale_color_manual(values = c("#5EF1F1", "#FA62D0")) +
labs(title = " Vizualizing trigrams in Jane Austen's, Emma") +
scale_size(range = c(3, 8))
p
Notes
There are a couple of differences between the Viz I posted online yesterday and the result here in this post due to a couple of mistakes found in the code during cleanup.
Extra vizualisations
<- 20
n_word <- 150
n_top <- 3
n_gramming
<- tibble(text = janeaustenr::emma) %>%
trigrams unnest_tokens(trigram, text, token = "ngrams", n = n_gramming)
<- c("i", "you") start_words
<- 20
n_word <- 150
n_top <- 3
n_gramming
library(rvest)
<- read_html("https://sherlock-holm.es/stories/plain-text/cnus.txt") %>%
sherlock_holmes html_text() %>%
str_split("\n") %>%
unlist()
<- tibble(text = sherlock_holmes) %>%
trigrams unnest_tokens(trigram, text, token = "ngrams", n = n_gramming)
<- c("holmes", "watson") start_words
session information
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value 4.0.5 (2021-03-31)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Honolulu
tz Pacific2021-07-05
date
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
backports 1.3 2021-04-14 [1] CRAN (R 4.0.2)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.0.2)
bookdown 0.7.6 2021-04-05 [1] CRAN (R 4.0.2)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.0.2)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.0.2)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.0.2)
clipr 0.2-18 2020-11-04 [1] CRAN (R 4.0.5)
codetools 2.0-2 2021-06-24 [1] CRAN (R 4.0.2)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
crayon 4.3.2 2021-06-23 [1] CRAN (R 4.0.2)
curl 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.0.2)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.0.2)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.0.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
generics * 3.3.5 2021-06-25 [1] CRAN (R 4.0.2)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.0.2)
haven 0.9 2021-04-16 [1] CRAN (R 4.0.2)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
httr 0.1.5 2017-06-10 [1] CRAN (R 4.0.0)
janeaustenr 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.0.2)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
labeling 0.20-41 2020-04-02 [1] CRAN (R 4.0.5)
lattice 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
lifecycle 1.7.10 2021-02-26 [1] CRAN (R 4.0.2)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
magrittr 1.3-2 2021-01-06 [1] CRAN (R 4.0.5)
Matrix 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.0.2)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
purrr * 0.0.7 2020-12-16 [1] CRAN (R 4.0.2)
purrrlyr 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
R6 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.0.2)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rlang 2.9 2021-06-15 [1] CRAN (R 4.0.2)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rstudioapi * 1.0.0 2021-03-09 [1] CRAN (R 4.0.2)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.0.2)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
sessioninfo 0.7.0 2020-04-01 [1] CRAN (R 4.0.0)
SnowballC 1.6.2 2021-05-17 [1] CRAN (R 4.0.2)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
stringr * 3.1.2 2021-05-16 [1] CRAN (R 4.0.2)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tidyselect * 0.3.1 2021-04-10 [1] CRAN (R 4.0.2)
tidytext * 1.3.1 2021-04-15 [1] CRAN (R 4.0.2)
tidyverse 0.2.1 2018-03-29 [1] CRAN (R 4.0.0)
tokenizers 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vctrs 2.4.2 2021-04-18 [1] CRAN (R 4.0.2)
withr 0.23 2021-05-15 [1] CRAN (R 4.0.2)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library [