This code has been lightly revised to make sure it works as of 2018-12-16.
Hello again! I this mini-series (of in-determined length) will I try as best as I can to recreate great visualizations in tidyverse. The recreation may be exact in terms of data or using data of a similar style.
The goal - An annual sunshine record report
I have recently read The Visual Display of Quantitative Information by Edward R Tufte, which I highly recommend. In the book the following chart was displayed which showed the sunshine record for each day of the year.
F.J. Monkhouse and H.R. Wilkinson, Maps and Diagrams (London, third edition 1971), 242-243.
The goal for the rest of this post is to create something similar. Since we don’t have direct access to the data, we will scrape some data for ourselves. All code will be shown together at the end of the post and this gist
R packages
First, we need some packages
library(rvest)
library(tidyverse)
library(lubridate)
library(glue)
library(ehlib) # devtools::install_github("EmilHvitfeldt/ehlib")
The last package is my personal R package ehlib where I store some frequently used functions. If you do not wish to install/load this package just run the following code:
<- function(string, start, end) {
str_between ::str_extract(string,
stringr::str_c(start, '(.*?)', end, collapse = '')) %>%
stringr::str_replace(start, "") %>%
stringr::str_replace(end, "")
stringr
}
<- function(string, pattern) {
str_before ::str_extract(string, stringr::str_c(".+?(?=", pattern, ")"))
stringr }
Data collection
So for this production, we need Weather information. But more specifically we need information about if the sun is shining for various times during the day, preferable for all days of the year. In addition, sunrise and sunset times are also needed.
We will be scraping weather history from wunderground. On the button of the page https://www.wunderground.com/history/airport/KCQT/2018/1/1/DailyHistory.html, we locate a table with “Time” and “Conditions”. Furthermore, both sunrise and sunset times are present on the page.
For the website, we need an airport code, year, month, and day. Airport codes will have to be found manually by browsing the website. For a vector of all the days in a given year, we use the following function that uses
<- function(year) {
all_dates_in if(ymd(glue::glue("{year}0101")) > as.Date(Sys.time())) {
stop("Please select a past or current year.")
}
<- ymd(glue::glue("{year}0101"))
start
if(as.Date(Sys.time()) > ymd(glue::glue("{year}1231"))) {
<- ymd(glue::glue("{year}1231"))
end else {
} <- as.Date(Sys.time())
end
}
seq(start, end, by = "day")
}
this function will work even if you pick a year that has not ended yet. As 2017 has just ended I thought it would be appropriate to look back on that year.
<- 2017
year <- all_dates_in(year)
dates head(dates)
## [1] "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" "2017-01-05"
## [6] "2017-01-06"
next, we have a little function that creates a URL from the airport code and the date. For safety, we will wrap that function in purrr::safely
.
<- function(date, code) {
weather_data_html <- str_c("https://www.wunderground.com/history/airport/", code, "/",
url year(date), "/", month(date), "/", mday(date), "/DailyHistory.html")
<- read_html(url)
html_url
}
<- purrr::safely(weather_data_html) weather_data_html
For this code-though will be using airport code KCQT, which is placed in Los Angeles Downtown, CA.
We add some ‘crawl-delay’ of 5 seconds and let it run. Please remember that this will take over 30 minutes to run with a delay in place but we do it to be nice.
<- "KCQT"
airport_code
<- map(dates, ~{
full_data weather_data_html(.x, airport_code)
Sys.sleep(5)
cat(month(.x), "/", mday(.x), "\n", sep = "")
})
We can check whether all of the links went through.
map_lgl(full_data, ~ is.null(.x$error))
Data wrangling
Since we will be working with times quite a lot in the section we will use the lubridate
package for quite some time. In addition to that package, I have devised the following function to turn something of the form “2:51 PM” into the number of minutes after midnight.
<- function(x) {
ampm_minutes as.numeric(str_between(x, ":", " ")) +
as.numeric(str_replace(str_before(x, ":"), "12", "0")) * 60 +
60 * 12 * str_detect(x, "PM")
}
Next, we have the main wrangling function that takes the input, extracts the sunrise, sunset times, and adds them to the table that is also extracted.
<- function(html_url, date) {
data_wrangling
# Sun rise time
<- html_url %>%
sun_rise html_nodes('div[id="astronomy-mod"] table') %>%
html_text() %>%
1] %>%
.[str_between("Time\n\t\t", "\n\t\t")
# Sun set time
<- html_url %>%
sun_set html_nodes('div[id="astronomy-mod"] table') %>%
html_text() %>%
1] %>%
.[str_between("\n\t\t", "\n\t\tCivil")
# Table
<- html_url %>%
table html_nodes('table[id="obsTable"]') %>%
html_table() %>%
1]]
.[[
# Time column standardization
<- any("Time (PDT)" == names(table),
is_daylight "Time (MDT)" == names(table),
"Time (CDT)" == names(table),
"Time (EDT)" == names(table))
<- str_c("Time", c(" (PDT)", " (MDT)", " (CDT)", " (EDT)",
time_names " (PST)", " (MST)", " (CST)", " (EST)"))
names(table) <- if_else(names(table) %in% time_names,
"Time",
names(table))
%>%
table mutate(sun_set = sun_set,
sun_rise = sun_rise,
date = date,
yday = yday(date),
day_minutes = ampm_minutes(Time) - is_daylight * 60,
set_minutes = ampm_minutes(sun_set) - is_daylight * 60,
rise_minutes = ampm_minutes(sun_rise) - is_daylight * 60,
sun_up = day_minutes > (rise_minutes + 90) &
< (set_minutes - 30))
day_minutes }
In this function, we arbitrarily decide that the sun is up if it is 90 minutes after sunrise and 30 minutes before sunset. This is done because our future visualization is being made with rectangles and the lag
function, and to ensure that all the sunshine hours are within sunset and sunrise we have to put in some restrains.
It seems that the 30th of October doesn’t have hourly history data available so we will exclude it in the following:
<- map2_df(full_data[-303], dates[-303], ~ .x$result %>%
full_data2 data_wrangling(.y))
At this point, it would be wise to save our data.
save(full_data2, file = glue("{airport_code}-{year}.Rdata"))
Plotting data
Now that we have all the data we need it is time to turn our heads to ggplot2. But before we do that let us create some axis breaks that we will need.
<- dates %>% month() %>% table() %>% cumsum()
x_axis names(x_axis) <- month.abb[1:12]
<- 1:24 * 60
y_axis names(y_axis) <- str_c(c(12, rep(1:12, 2, length.out = 23)),
rep(c("AM", "PM"), each = 12))
So we start by creating a new condition for “Clear”, creating a new day_minutes variable to act as the other side for our sunshine rectangles and lastly remove all the observations where the sun isn’t up. Using geom_rect()
to create all the little rectangles and geom_line()
’s to show the sun set and sun rise, we lastly fiddle a little with the theme giving us the final result:
%>%
full_data2 mutate(con = Conditions == "Clear",
day_minutes2 = lag(day_minutes)) %>%
filter(sun_up) %>%
ggplot(aes(fill = con)) +
geom_rect(aes(xmin = yday, xmax = yday + 1,
ymin = day_minutes, ymax = day_minutes2)) +
geom_line(aes(yday, set_minutes)) +
geom_line(aes(yday, rise_minutes)) +
scale_fill_manual(values = c("grey40", NA)) +
theme_minimal() +
guides(fill = "none") +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x.bottom = element_text(hjust = 1.7)
+
) scale_x_continuous(breaks = x_axis, position = "right") +
scale_y_continuous(breaks = y_axis, limits = c(0, 24 * 60)) +
labs(x = NULL, y = NULL, title = "Sunshine report of Los Angeles 2017")
## Warning: Position guide is perpendicular to the intended axis. Did you mean to
## specify a different guide `position`?
Extra
## Warning: Position guide is perpendicular to the intended axis. Did you mean to
## specify a different guide `position`?
Code
library(rvest)
library(tidyverse)
library(lubridate)
library(glue)
#library(ehlib) # devtools::install_github("EmilHvitfeldt/ehlib")
<- function(string, start, end) {
str_between ::str_extract(string,
stringr::str_c(start, '(.*?)', end, collapse = '')) %>%
stringr::str_replace(start, "") %>%
stringr::str_replace(end, "")
stringr
}
<- function(string, pattern) {
str_before ::str_extract(string, stringr::str_c(".+?(?=", pattern, ")"))
stringr
}
<- function(year) {
all_dates_in if(ymd(glue::glue("{year}0101")) > as.Date(Sys.time())) {
stop("Please select a past or current year.")
}
<- ymd(glue::glue("{year}0101"))
start
if(as.Date(Sys.time()) > ymd(glue::glue("{year}1231"))) {
<- ymd(glue::glue("{year}1231"))
end else {
} <- as.Date(Sys.time())
end
}
seq(start, end, by = "day")
}
<- "KCQT"
airport_code
<- map(dates, ~{
full_data weather_data_html(.x, airport_code)
Sys.sleep(5)
cat(month(dates), "/", mday(dates), "\n", sep = "")
})
map_lgl(full_data, ~ is.null(.x$error))
<- function(x) {
ampm_minutes as.numeric(str_between(x, ":", " ")) +
as.numeric(str_replace(str_before(x, ":"), "12", "0")) * 60 +
60 * 12 * str_detect(x, "PM")
}
<- function(html_url, date) {
data_wrangling
# Sun rise time
<- html_url %>%
sun_rise html_nodes('div[id="astronomy-mod"] table') %>%
html_text() %>%
1] %>%
.[str_between("Time\n\t\t", "\n\t\t")
# Sun set time
<- html_url %>%
sun_set html_nodes('div[id="astronomy-mod"] table') %>%
html_text() %>%
1] %>%
.[str_between("\n\t\t", "\n\t\tCivil")
# Table
<- html_url %>%
table html_nodes('table[id="obsTable"]') %>%
html_table() %>%
1]]
.[[
# Time column standardization
<- any("Time (PDT)" == names(table),
is_daylight "Time (MDT)" == names(table),
"Time (CDT)" == names(table),
"Time (EDT)" == names(table))
<- str_c("Time", c(" (PDT)", " (MDT)", " (CDT)", " (EDT)",
time_names " (PST)", " (MST)", " (CST)", " (EST)"))
names(table) <- if_else(names(table) %in% time_names,
"Time",
names(table))
%>%
table mutate(sun_set = sun_set,
sun_rise = sun_rise,
date = date,
yday = yday(date),
day_minutes = ampm_minutes(Time) - is_daylight * 60,
set_minutes = ampm_minutes(sun_set) - is_daylight * 60,
rise_minutes = ampm_minutes(sun_rise) - is_daylight * 60,
sun_up = day_minutes > (rise_minutes + 90) &
< (set_minutes - 30))
day_minutes
}
<- map2_df(full_data[-303], dates[-303], ~ .x$result %>%
full_data2 data_wrangling(.y))
<- dates %>% month() %>% table() %>% cumsum()
x_axis names(x_axis) <- month.abb[1:12]
<- 1:24 * 60
y_axis names(y_axis) <- str_c(c(12, rep(1:12, 2, length.out = 23)),
rep(c("AM", "PM"), each = 12))
%>%
full_data2 mutate(con = Conditions == "Clear",
day_minutes2 = lag(day_minutes)) %>%
filter(sun_up) %>%
ggplot(aes(fill = con)) +
geom_rect(aes(xmin = yday, xmax = yday + 1,
ymin = day_minutes, ymax = day_minutes2)) +
geom_line(aes(yday, set_minutes)) +
geom_line(aes(yday, rise_minutes)) +
scale_fill_manual(values = c("grey40", NA)) +
theme_minimal() +
guides(fill = "none") +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x.bottom = element_text(hjust = 1.7)
+
) scale_x_continuous(breaks = x_axis, position = "right") +
scale_y_continuous(breaks = y_axis, limits = c(0, 24 * 60)) +
labs(x = NULL, y = NULL, title = "Sunshine report of Los Angeles 2017")
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.0.5 (2021-03-31)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Honolulu
tz Pacific2021-07-05
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
backports 1.3 2021-04-14 [1] CRAN (R 4.0.2)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.0.2)
bookdown 0.7.6 2021-04-05 [1] CRAN (R 4.0.2)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.0.2)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.0.2)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.0.2)
clipr 2.0-2 2021-06-24 [1] CRAN (R 4.0.2)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
crayon 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.0.2)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.0.2)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dplyr * 0.2.7 2021-07-05 [1] Github (emilhvitfeldt/ehlib@8c40172)
ehlib 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.0.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
fansi * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
generics * 3.3.5 2021-06-25 [1] CRAN (R 4.0.2)
ggplot2 * 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.0.2)
haven 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
httr 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.0.2)
knitr 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
lifecycle * 1.7.10 2021-02-26 [1] CRAN (R 4.0.2)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
magrittr 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.0.2)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
R6 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.0.2)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rlang 2.9 2021-06-15 [1] CRAN (R 4.0.2)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rstudioapi * 1.0.0 2021-03-09 [1] CRAN (R 4.0.2)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.0.2)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
sessioninfo 1.6.2 2021-05-17 [1] CRAN (R 4.0.2)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
stringr * 3.1.2 2021-05-16 [1] CRAN (R 4.0.2)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tidyselect * 1.3.1 2021-04-15 [1] CRAN (R 4.0.2)
tidyverse 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vctrs 2.4.2 2021-04-18 [1] CRAN (R 4.0.2)
withr 0.23 2021-05-15 [1] CRAN (R 4.0.2)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library [