This code has been lightly revised to make sure it works as of 2018-12-16.
After attending useR!2017 for the first time, which great pleasure and new connections made. I decided to see if I could extract some of the information available in the public schedule. So as with my last post, I’ll do a bit of scraping followed by a few visualizations.
Packages
library(tidyverse)
library(utils)
library(plotly)
library(ltm)
require(visNetwork)
Web scraping
I found this task easiest with the help of purrr:map()
. First, we find the full schedules at the following links
https://user2017.sched.com/2017-07-04/overview (Tuesday)
https://user2017.sched.com/2017-07-05/overview (Wednesday)
https://user2017.sched.com/2017-07-06/overview (Thursday)
https://user2017.sched.com/2017-07-07/overview (Friday)
then we read the entire page into a tibble along with a day variable.
<- c("Tuesday", "Wednesday", "Thursday", "Friday")
day <- paste0("https://user2017.sched.com/2017-07-0", 4:7, "/overview", sep = "")
link
<- map2_df(link, day,
event0 ~ tibble(text = readLines(.x),
day = .y))
then with the help of stringr
we extract the desired information from the document, following the idiom that “multiple simple regex is better than one complicated one.” I also filtered out most non-talk events.
<- event0 %>%
events filter(str_detect(text, "<span class='") | str_detect(text, "</h3>"),
!str_detect(text, "REGISTRATION"),
!str_detect(text, "COFFEE BREAK"),
!str_detect(text, "LUNCH"),
!str_detect(text, "WELCOME"),
!str_detect(text, "Poster"),
!str_detect(text, "RIOT SESSION"),
!str_detect(text, "Buses"),
!str_detect(text, "Dinner"),
!str_detect(text, "CLOSING")) %>%
mutate(time = str_extract(text, "<h3>.{1,7}"), # time
time = str_replace(time, "<h3> *", ""),
id = str_extract(text, "id='\\S{32}"), # id
id = str_replace(id, "id='", ""),
name = str_extract(text, str_c(id, ".*")), # name
name = str_replace(name, str_c(id, "'>"), ""),
name = str_extract(name, "^.*(?=( <span))"),
room = str_extract(text, 'vs">(.*?)<'),
room = str_replace(room, 'vs">', ""),
room = str_replace(room, '<',"")) %>% # room
fill(time) %>%
filter(!str_detect(text, "<h3>")) %>%
::select(-text) dplyr
let’s take a look at what we have by now just to see that we have what we want.
head(events)
## # A tibble: 6 x 5
## day time id name room
## <chr> <chr> <chr> <chr> <chr>
## 1 Tuesday 9:30am 893eab219225a0990770… Data Carpentry: Open and Reproduc… 2.02
## 2 Tuesday 9:30am 30c0eebdc887f3ad3aef… Dose-response analysis using R 4.02
## 3 Tuesday 9:30am 57ce234e5ce9082da3cc… Geospatial visualization using R 4.03
## 4 Tuesday 9:30am 95b110146486b0a5f802… Introduction to Bayesian inferenc… 2.01
## 5 Tuesday 9:30am 7294f7df20ab1a7c37df… Introduction to parallel computin… 3.01
## 6 Tuesday 9:30am f15703fe51e89294f2b5… Rcpp: From Simple Examples to Mac… PLENA…
Now that we have all the information about the different events we can scrape every event page to find its attendees. This following chuck of code might seem a little hard at first, it helps to notice that there is a second tibble inside the big tibble.
<- map_df(events$id,
people ~ tibble(attendee = tibble(text = readLines(
str_c("https://user2017.sched.com/event-goers/", .x))) %>%
filter(str_detect(text, " +<li><a href=")) %>%
$text %>%
.str_split(., "li><li") %>%
unlist(),
id = .x) %>%
mutate(attendee = str_replace(attendee, "(.*?)title=\"", ""),
attendee = str_replace(attendee, "\"><(.*)", "")) %>%
filter(!str_detect(attendee, "venue"),
!str_detect(attendee, "Private")))
Let’s again take a look at what we have by now just to see that we have what we want.
head(people)
## # A tibble: 6 x 2
## attendee id
## <chr> <chr>
## 1 " <li><a href=\"\">Schedule</a></li>" 893eab219225a09907704…
## 2 " … 893eab219225a09907704…
## 3 "lvaudor" 893eab219225a09907704…
## 4 "Alan Ponce" 893eab219225a09907704…
## 5 "bpiccolo" 893eab219225a09907704…
## 6 "Katharina Barzagar Nazari" 893eab219225a09907704…
visualizations
With a data set with this many possibilities, the options are quite a few, so here I’ll just list a few of the ones I found handy. So first we just do a simple bubble plot, this will be done with left_join
’s and count
and piped straight into ggplot
.
left_join(events, people, by = "id") %>%
count(id) %>%
left_join(events, by = "id") %>%
filter(day == "Friday") %>%
ggplot(aes(time, room, size = n)) +
geom_point() +
theme_bw() +
scale_size(range = c(5, 20)) +
labs(title = "useR!2017 Friday schedule",
x = "")
Since both our room
and time
were simply character vectors, the ordering is not right. This can be fixed by setting the levels correctly. Here I have the ordered vectored for both room
and time
.
<- c("9:15am", "9:30am", "11:00am", "11:18am", "11:30am", "11:36am",
time_levels "11:54am", "12:12pm", "1:15pm", "1:30pm", "1:48pm", "2:00pm",
"2:06pm", "2:24pm", "2:42pm", "3:30pm", "3:45pm", "4:00pm",
"4:45pm", "4:55pm", "5:00pm", "5:05pm", "5:30pm", "5:35pm",
"5:40pm", "5:45pm", "5:50pm", "5:55pm", "6:00pm", "6:05pm",
"6:10pm", "6:15pm", "6:20pm", "7:00pm")
<- c("PLENARY", "2.01", "2.02", "3.01", "3.02", "4.01", "4.02") room_levels
and we deal with it with a single mutate like so
left_join(events, people, by = "id") %>%
count(id) %>%
left_join(events, by = "id") %>%
mutate(time = factor(time, time_levels),
room = factor(room, room_levels)) %>%
filter(day == "Friday") %>%
ggplot(aes(time, room, size = n)) +
geom_point() +
theme_bw() +
scale_size(range = c(5, 20)) +
labs(title = "useR!2017 Friday schedule",
x = "")
another way to visualize it would be to use a stacked bar chart so
<- left_join(events, people, by = "id") %>%
p count(id) %>%
left_join(events, by = "id") %>%
filter(day == "Thursday") %>%
mutate(time = factor(time, time_levels),
room = factor(room, rev(room_levels))) %>%
ggplot(aes(time, fill = room, text = name)) +
geom_bar(aes(weight = n)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "useR!2017 Thursday schedule",
x = "")
p
or with a bit of interactivity plotly::ggplotly
can be used so that is possible to hover over each event to see name and size.
ggplotly(p, tooltip = c("n", "name"), width = 700, height = 500)
Network graph
To make our-self a simple network graph will I be using the visNetwork package, which has a lovely vignette. So here first of all to create a manageable graph I’ll subset all the Wednesday talks in room 4.02, which was the “Shiny I” and “Text Mining” block.
<- left_join(events, people, by = "id") %>%
sub_data filter(day == "Wednesday", room == "4.02") %>%
::select(name, attendee, time) dplyr
I this graph I will let each node be an event and let the edges be to which degree they share attendees. So we start
<- sub_data %>%
node_size group_by(name, time) %>%
summarize(n = n())
to find how many attendees the events share we first find all the different pairs of events with utils::combn
function and with purrr
and inner_join
finds how many they have in common. Since utils::combn
gives its result in a matrix we have to fiddle just a bit to our way back to a tibble.
<- combn(node_size$name, 2) %>%
conn as_tibble() %>%
map_int(~ inner_join(sub_data %>% filter(name == .x[1]),
%>% filter(name == .x[2]), by = "attendee")
sub_data %>% nrow()) %>%
rbind(combn(node_size$name, 2)) %>% t() %>% as.tibble()
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
names(conn) <- c("n", "from", "to")
conn## # A tibble: 45 x 3
## n from to
## <chr> <chr> <chr>
## 1 21 A Tidy Data Model for Natural Lan… bradio: Add data music widgets to y…
## 2 57 A Tidy Data Model for Natural Lan… Developing and deploying large scal…
## 3 82 A Tidy Data Model for Natural Lan… How we built a Shiny App for 700 us…
## 4 84 A Tidy Data Model for Natural Lan… Interacting with databases from Shi…
## 5 84 A Tidy Data Model for Natural Lan… manifestoR - a tool for data journa…
## 6 99 A Tidy Data Model for Natural Lan… Neural Embeddings and NLP with R an…
## 7 83 A Tidy Data Model for Natural Lan… ShinyProxy
## 8 155 A Tidy Data Model for Natural Lan… Text Analysis and Text Mining Using…
## 9 168 A Tidy Data Model for Natural Lan… Text mining, the tidy way
## 10 46 bradio: Add data music widgets to… Developing and deploying large scal…
## # … with 35 more rows
for the node tibble we need to supply it with an id column, but I will also supply it with a label (name of the event), size (number of people in the event), and color (what book is this event in. green = Shiny I, blue = Text Mining).
<- c("11:00am", "11:18am", "11:36am", "11:54am", "12:12pm")
Shiny_I <- c("1:30pm", "1:48pm", "2:06pm", "2:24pm", "2:42pm")
Text_Mining <- node_size %>%
nodes mutate(id = name,
label = str_wrap(name, width = 20),
size = n / 20,
color = case_when(
%in% Shiny_I ~ "lightgreen",
time %in% Text_Mining ~ "lightblue"
time ))
for the edge tibble it needs from and to columns that match with the id in the node tibble. I will also supply with a constant color column (because if omitted it would borrow from the node coloring) and a width column to indicate how many attendees they share. This is again done with a couple of left_joins and the connectivity is the average percentage of attendees they share. Lastly, we remove any edge with less than 0.5 connectivity to clear out the graph.
<- conn %>%
edges left_join(node_size %>%
::select(-time) %>%
dplyrrename(n_from = n),
by = c("from" = "name")) %>%
left_join(node_size %>%
::select(-time) %>%
dplyrrename(n_to = n),
by = c("to" = "name")) %>%
mutate(n = as.numeric(n),
n_to = as.numeric(n_to),
n_from = as.numeric(n_from),
connectivity = (n / n_from + n / n_to) / 2,
width = connectivity * 10,
color = "grey") %>%
filter(connectivity > 0.5)
This yields us with a wonderful graph which shows a somehow clear divide between the two blocks.
visNetwork(nodes, edges, width = "100%")
I hope you enjoyed this post and I would love you to see any visualization or analysis you might have regarding this data.
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.0.5 (2021-03-31)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Honolulu
tz Pacific2021-07-04
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
backports 1.3.2 2021-06-06 [1] Github (rstudio/blogdown@00a2090)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.0.2)
bookdown 0.7.6 2021-04-05 [1] CRAN (R 4.0.2)
broom 0.2.4.9003 2021-05-05 [1] Github (rstudio/bslib@ba6a80d)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.0.2)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.0.2)
clipr 0.2-18 2020-11-04 [1] CRAN (R 4.0.5)
codetools 2.0-1 2021-05-04 [1] CRAN (R 4.0.2)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
crayon 1.1.1 2021-01-12 [1] CRAN (R 4.0.2)
crosstalk 1.14.0 2021-02-21 [1] CRAN (R 4.0.2)
data.table 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.0.2)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.0.2)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.0.0)
evaluate 0.999-6 2021-01-13 [1] CRAN (R 4.0.2)
expm 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
generics * 3.3.3 2020-12-30 [1] CRAN (R 4.0.2)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.0.2)
haven 0.9 2021-04-16 [1] CRAN (R 4.0.2)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
htmltools 1.5.3 2020-12-10 [1] CRAN (R 4.0.2)
htmlwidgets 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
httr 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.0.2)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
labeling 0.20-41 2020-04-02 [1] CRAN (R 4.0.5)
lattice 0.2.2 2019-03-15 [1] CRAN (R 4.0.0)
lazyeval 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
lifecycle * 1.1-1 2018-04-17 [1] CRAN (R 4.0.2)
ltm 1.7.10 2021-02-26 [1] CRAN (R 4.0.2)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
magrittr * 7.3-53.1 2021-02-12 [1] CRAN (R 4.0.5)
MASS 1.3-2 2021-01-06 [1] CRAN (R 4.0.5)
Matrix 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
modelr * 1.6.8 2019-12-16 [1] CRAN (R 4.0.2)
msm 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
munsell 1.1-1 2020-06-09 [1] CRAN (R 4.0.0)
mvtnorm 1.6.1 2021-05-16 [1] CRAN (R 4.0.2)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
pkgconfig * 4.9.2.9000 2020-12-29 [1] Github (ropensci/plotly@e741959)
plotly 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
png * 0.7-10 2019-08-05 [1] CRAN (R 4.0.2)
polycor * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
R6 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.0.2)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rlang 2.8.6 2021-06-06 [1] Github (rstudio/rmarkdown@9dc5d97)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rstudioapi 1.0.0 2021-03-09 [1] CRAN (R 4.0.2)
rvest 0.3.1.9003 2021-05-05 [1] Github (rstudio/sass@6166162)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
sessioninfo 1.6.2 2021-05-17 [1] CRAN (R 4.0.2)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
stringr 3.2-10 2021-03-16 [1] CRAN (R 4.0.5)
survival * 3.1.2 2021-05-16 [1] CRAN (R 4.0.2)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tidyselect * 1.3.1 2021-04-15 [1] CRAN (R 4.0.2)
tidyverse 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vctrs 0.4.0 2021-04-13 [1] CRAN (R 4.0.2)
viridisLite * 2.0.9 2019-12-06 [1] CRAN (R 4.0.0)
visNetwork 2.4.2 2021-04-18 [1] CRAN (R 4.0.2)
withr 0.23 2021-05-15 [1] CRAN (R 4.0.2)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library [