This code has been lightly revised to make sure it works as of 2018-12-19.
Text summarization
In the realm of text summarization there two main paths:
- extractive summarization
- abstractive summarization
Where extractive scoring words and sentences according to some metric and then using that information to summarize the text. Usually done by copy/pasting (extracting) the most informative parts of the text.
The abstractive methods aim to build a semantic representation of the text and then use natural language generation techniques to generate text describing the informative parts.
Extractive summarization is primarily the simpler task, with a handful of algorithms do will do the scoring. While with the advent of deep learning did NLP has a boost in abstractive summarization methods.
This post will focus on an example of an extractive summarization method called TextRank which is based on the PageRank algorithm that is used by Google to rank websites by their importance.
TextRank Algorithm
The TextRank algorithm is based on a graph-based ranking algorithm. Generally used in web searches at Google, but have many other applications. Graph-based ranking algorithms try to decide the importance of a vertex by taking into account information about the entire graph rather than the vertex-specific information. A typical piece of information would be the information between relationships (edges) between the vertices.
In the NLP case, we need to define what we want to use as vertices and edges. In our case will we be using sentences as the vertices and words as the connection edges. So sentences with words that appear in many other sentences are seen as more important.
Data preparation
We start by loading the appropriate packages, which include tidyverse
for general tasks, tidytext
for text manipulations, textrank
for the implementation of the TextRank algorithm, and finally rvest
to scrape an article to use as an example. The GitHub for the textrank
package can be found here.
library(tidyverse)
library(tidytext)
library(textrank)
library(rvest)
To showcase this method I have randomly (EXTENSIVELY filtered political and controversial) selected an article as our guinea pig. The main body is selected using the html_nodes
.
<- "http://time.com/5196761/fitbit-ace-kids-fitness-tracker/"
url <- read_html(url) %>%
article html_nodes('div[class="padded"]') %>%
html_text()
next, we load the article into a tibble
(since tidytext required the input as a data.frame). We start by tokenizing according to sentences which are done by setting token = "sentences"
in unnest_tokens
. The tokenization is not always perfect using this tokenizer, but it has a low number of dependencies and is sufficient for this showcase. Lastly, we add a sentence number column and switch the order of the columns (textrank_sentences
prefers the columns in a certain order).
<- tibble(text = article) %>%
article_sentences unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
next, we will tokenize again but this time to get words. In doing this we will retain the sentence_id
column in our data.
<- article_sentences %>%
article_words unnest_tokens(word, sentence)
now we have all the sufficient input for the textrank_sentences
function. However, we will go one step further and remove the stop words in article_words
since they would appear in most of the sentences and don’t really carry any information in themself.
<- article_words %>%
article_words anti_join(stop_words, by = "word")
Running TextRank
Running the TextRank algorithm is easy, the textrank_sentences
function only required 2 inputs.
- A data.frame with sentences
- A data.frame with tokens (in our case words) which are part of each sentence
So we are ready to run
<- textrank_sentences(data = article_sentences,
article_summary terminology = article_words)
The output has its own printing method that displays the top 5 sentences:
article_summary## Textrank on sentences, showing top 5 most important sentences found:
## 1. fitbit is launching a new fitness tracker designed for children called the fitbit ace, which will go on sale for $99.95 in the second quarter of this year.
## 2. fitbit says the tracker is designed for children eight years old and up.
## 3. the fitbit ace looks a lot like the company’s alta tracker, but with a few child-friendly tweaks.
## 4. like many of fitbit’s other products, the fitbit ace can automatically track steps, monitor active minutes, and remind kids to move when they’ve been still for too long.
## 5. the most important of which is fitbit’s new family account option, which gives parents control over how their child uses their tracker and is compliant with the children’s online privacy protection act, or coppa.
Which in itself is pretty good.
Digging deeper
While the printing method is good, we can extract the information to good some further analysis. The information about the sentences is stored in sentences
. It includes the information article_sentences
plus the calculated textrank score.
"sentences"]] article_summary[[
Let’s begging by extracting the top 3 and bottom 3 sentences to see how they differ.
"sentences"]] %>%
article_summary[[arrange(desc(textrank)) %>%
slice(1:3) %>%
pull(sentence)
## [1] "fitbit is launching a new fitness tracker designed for children called the fitbit ace, which will go on sale for $99.95 in the second quarter of this year."
## [2] "fitbit says the tracker is designed for children eight years old and up."
## [3] "the fitbit ace looks a lot like the company’s alta tracker, but with a few child-friendly tweaks."
As expected these are the same sentences as we saw earlier. However, the button sentences don’t include the word Fitbit (properly a rather important word) and focus more on “other” things, like the reference to another product in the second sentence.
"sentences"]] %>%
article_summary[[arrange(textrank) %>%
slice(1:3) %>%
pull(sentence)
## [1] "conversations with the most influential leaders in business and tech."
## [2] "please try again later."
## [3] "click the link to confirm your subscription and begin receiving our newsletters."
If we look at the article over time, it would be interesting to see where the important sentences appear.
"sentences"]] %>%
article_summary[[ggplot(aes(textrank_id, textrank, fill = textrank_id)) +
geom_col() +
theme_minimal() +
scale_fill_viridis_c() +
guides(fill = "none") +
labs(x = "Sentence",
y = "TextRank score",
title = "4 Most informative sentences appear within first half of sentences",
subtitle = 'In article "Fitbits Newest Fitness Tracker Is Just for Kids"',
caption = "Source: http://time.com/5196761/fitbit-ace-kids-fitness-tracker/")
Working with books???
Summaries help cut down the reading when used on articles. Would the same approach work on books? Let’s see what happens when you exchange “sentence” in “article” with “chapter” in “book”. I’ll go to my old friend emma
from the janeaustenr
package. We will borrow some code from the Text Mining with R book to create the chapters. Remember that we want 1 chapter per row.
<- janeaustenr::emma %>%
emma_chapters tibble(text = .) %>%
mutate(chapter_id = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
filter(chapter_id > 0) %>%
group_by(chapter_id) %>%
summarise(text = paste(text, collapse = ' '))
and proceed as before to find the words and remove the stop words.
<- emma_chapters %>%
emma_words unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
We run the textrank_sentences
function again. It should still be very quick, as the bottleneck of the algorithm is more with the number of vertices rather than their individual size.
<- textrank_sentences(data = emma_chapters,
emma_summary terminology = emma_words)
We will be careful not to use the standard printing method as it would print 5 whole chapters!!
Instead, we will look at the bar chart again to see if the important chapters appear in any particular order.
"sentences"]] %>%
emma_summary[[ggplot(aes(textrank_id, textrank, fill = textrank_id)) +
geom_col() +
theme_minimal() +
scale_fill_viridis_c(option = "inferno") +
guides(fill = "none") +
labs(x = "Chapter",
y = "TextRank score",
title = "Chapter importance in the novel Emma by Jane Austen") +
scale_x_continuous(breaks = seq(from = 0, to = 55, by = 5))
Which doesn’t appear to be the case in this particular text (which is properly good since skipping a chapter would be discouraged in a book like Emma). however, it might prove helpful in non-chronological texts.
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.0.5 (2021-03-31)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Honolulu
tz Pacific2021-07-05
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
backports 1.3 2021-04-14 [1] CRAN (R 4.0.2)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.0.2)
bookdown 0.7.6 2021-04-05 [1] CRAN (R 4.0.2)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.0.2)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.0.2)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.0.2)
clipr 0.2-18 2020-11-04 [1] CRAN (R 4.0.5)
codetools 2.0-2 2021-06-24 [1] CRAN (R 4.0.2)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
crayon 4.3.2 2021-06-23 [1] CRAN (R 4.0.2)
curl 1.14.0 2021-02-21 [1] CRAN (R 4.0.2)
data.table 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.0.2)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.0.2)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.0.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
generics * 3.3.5 2021-06-25 [1] CRAN (R 4.0.2)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.0.2)
haven 0.9 2021-04-16 [1] CRAN (R 4.0.2)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
httr 1.2.6 2020-10-06 [1] CRAN (R 4.0.2)
igraph 0.1.5 2017-06-10 [1] CRAN (R 4.0.0)
janeaustenr 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.0.2)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
labeling 0.20-41 2020-04-02 [1] CRAN (R 4.0.5)
lattice 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
lifecycle 1.7.10 2021-02-26 [1] CRAN (R 4.0.2)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
magrittr 1.3-2 2021-01-06 [1] CRAN (R 4.0.5)
Matrix 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.0.2)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
R6 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.0.2)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rlang 2.9 2021-06-15 [1] CRAN (R 4.0.2)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rstudioapi * 1.0.0 2021-03-09 [1] CRAN (R 4.0.2)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.0.2)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
scales 0.4-2 2019-11-20 [1] CRAN (R 4.0.0)
selectr 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
sessioninfo 0.7.0 2020-04-01 [1] CRAN (R 4.0.0)
SnowballC 1.6.2 2021-05-17 [1] CRAN (R 4.0.2)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
stringr * 0.3.1 2020-10-12 [1] CRAN (R 4.0.2)
textrank * 3.1.2 2021-05-16 [1] CRAN (R 4.0.2)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tidyselect * 0.3.1 2021-04-10 [1] CRAN (R 4.0.2)
tidytext * 1.3.1 2021-04-15 [1] CRAN (R 4.0.2)
tidyverse 0.2.1 2018-03-29 [1] CRAN (R 4.0.0)
tokenizers 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vctrs 0.4.0 2021-04-13 [1] CRAN (R 4.0.2)
viridisLite 2.4.2 2021-04-18 [1] CRAN (R 4.0.2)
withr 0.24 2021-06-15 [1] CRAN (R 4.0.2)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library [