This code has been lightly revised to make sure it works as of 2018-12-16.
ggpage version 0.2.0
In this post I will highlight a couple of the new features in the new update of my package ggpage.
first we load the packages we need, which is tidyverse
for general tidy tools, ggpage
for visualization and finally rtweet
and rvest
for data collection.
library(tidyverse)
library(ggpage)
library(rtweet)
library(rvest)
The basics
The packages includes 2 main functions, ggpage_build
and ggpage_plot
that will transform the data in the right way and plot it respectively. The reason for the split of the functions is to allow additional transformations to be done on the tokenized data to be used in the plotting.
The package includes a example data set of the text Tinderbox by H.C. Andersen
%>%
tinderbox head()
## # A tibble: 6 x 2
## text book
## <chr> <chr>
## 1 "A soldier came marching along the high road: \"Left, right - le… The tinder-…
## 2 "had his knapsack on his back, and a sword at his side; he had b… The tinder-…
## 3 "and was now returning home. As he walked on, he met a very frig… The tinder-…
## 4 "witch in the road. Her under-lip hung quite down on her breast,… The tinder-…
## 5 "and said, \"Good evening, soldier; you have a very fine sword, … The tinder-…
## 6 "knapsack, and you are a real soldier; so you shall have as much… The tinder-…
This data set can be used directly with ggpage_build
and ggpage_plot
.
ggpage_build(tinderbox) %>%
ggpage_plot()
ggpage_build
expects the column containing the text to be named “text” which it is in the tinderbox object. This visualization gets exiting when you start combining it with the other tools. We can show where the word “tinderbox” appears along with adding some page numbers.
ggpage_build(tinderbox) %>%
mutate(tinderbox = word == "tinderbox") %>%
ggpage_plot(mapping = aes(fill = tinderbox), page.number = "top-left")
And we see that the word tinderbox only appear 3 times in the middle of the story.
Vizualizing tweets
We can also use this to showcase a number of tweets. For this we will use the rtweet
package. We will load in 100 tweets that contain the hash tag #rstats.
## whatever name you assigned to your created app
<- "********"
appname
## api key (example below is not a real key)
<- "**********"
key
## api secret (example below is not a real key)
<- "********"
secret
## create token named "twitter_token"
<- create_token(
twitter_token app = appname,
consumer_key = key,
consumer_secret = secret)
<- rtweet::search_tweets("#rstats") %>%
rstats_tweets mutate(status_id = as.numeric(as.factor(status_id)))
Since each tweet is too long by itself will we use the nest_paragraphs
function to wrap the texts within each tweet. By passing the tweet id to page.col
we will make it such that we have a tweet per page. Additionally we can indicate both whether the tweet is a retweet by coloring the paper blue if it is and green if it isn’t. Lastly we highlight where “rstats” is used.
%>%
rstats_tweets select(status_id, text) %>%
nest_paragraphs(text) %>%
ggpage_build(page.col = "status_id", lpp = 4, ncol = 6) %>%
mutate(rstats = word == "rstats") %>%
ggpage_plot(mapping = aes(fill = rstats), paper.show = TRUE,
paper.color = ifelse(rstats_tweets$is_retweet, "lightblue", "lightgreen")) +
scale_fill_manual(values = c("grey60", "black")) +
labs(title = "100 #rstats tweets",
subtitle = "blue = retweet, green = original")
Vizualizing documents
Next we will look at the Convention on the Rights of the Child which we will scrape with rvest
.
<- "http://www.ohchr.org/EN/ProfessionalInterest/Pages/CRC.aspx"
url
<- read_html(url) %>%
rights_text html_nodes('div[class="boxtext"]') %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
str_wrap() %>%
str_split("\n") %>%
unlist() %>%
data.frame(text = ., stringsAsFactors = FALSE)
In this case will we remove the vertical space between the pages have it appear as a long paper like the website.
The wonderful case_when
comes in vary handy here when we want to highlight multiple different words.
for the purpose of the “United Nations” was it necessary to check that the words “united” and “nations” only appeared in pairs.
%>%
rights_text ggpage_build(wtl = FALSE, y_space_pages = 0, ncol = 7) %>%
mutate(highlight = case_when(word %in% c("child", "children") ~ "child",
%in% c("right", "rights") ~ "rights",
word %in% c("united", "nations") ~ "United Nations",
word TRUE ~ "other")) %>%
ggpage_plot(mapping = aes(fill = highlight)) +
scale_fill_manual(values = c("darkgreen", "grey", "darkblue", "darkred")) +
labs(title = "Word highlights in the 'Convention on the Rights of the Child'",
fill = NULL)
This is just a couple of different ways to use this package. I look forward to see what you guys can come up with.
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.1.0 (2021-05-18)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Los_Angeles
tz America2021-07-13
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
backports 1.3.2 2021-06-09 [1] Github (rstudio/blogdown@00a2090)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.1.0)
bookdown 0.7.8 2021-06-24 [1] CRAN (R 4.1.0)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.1.0)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.1.0)
clipr 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
codetools 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
crayon 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.1.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.1.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
generics * 0.2.3 2019-06-13 [1] CRAN (R 4.1.0)
ggpage * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
haven 0.9 2021-04-16 [1] CRAN (R 4.1.0)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
httr 0.1.5 2017-06-10 [1] CRAN (R 4.1.0)
janeaustenr 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.1.0)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
labeling 0.20-44 2021-05-02 [1] CRAN (R 4.1.0)
lattice 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
lifecycle 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
magrittr 1.3-3 2021-05-04 [1] CRAN (R 4.1.0)
Matrix 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
R6 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.1.0)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
rlang 2.9 2021-06-15 [1] CRAN (R 4.1.0)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rstudioapi * 0.7.0 2020-01-08 [1] CRAN (R 4.1.0)
rtweet * 1.0.0 2021-03-09 [1] CRAN (R 4.1.0)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
sessioninfo 0.7.0 2020-04-01 [1] CRAN (R 4.1.0)
SnowballC 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
stringr * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyselect 0.3.1 2021-04-10 [1] CRAN (R 4.1.0)
tidytext * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tidyverse 0.2.1 2018-03-29 [1] CRAN (R 4.1.0)
tokenizers 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vctrs 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
withr 0.24 2021-06-15 [1] CRAN (R 4.1.0)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library [