This blog post is going to describe how to write a customizable profiling function. If you are not familiar with profiling read the Profiling section of Advanced R to familiarize yourself, I’ll wait.
…
Welcome back!
Packages
While these packages aren’t strictly needed since most of what we are doing is happening in base R, am I still loading in tidyverse
to do some easier string manipulations and plotting.
library(tidyverse)
Profiling basics
You have properly used the profvis package. It is an amazing package and I use it on a daily basis. However, the amount of information you get can be overwhelming at times depending on your profiling goals.
Let’s propose in this scenario that we take in some data, scale and center it, apply PCA while only keeping the components that explain 90% of the variance and lastly apply CLARA clustering and return the classification.
The code to do that is contained in the following chunk.
<- function(x, threshold) {
pca_threshold <- prcomp(x, scale. = TRUE)
data_pca <- sum(data_pca$sdev ^ 2)
total_var <- which.max(cumsum(data_pca$sdev ^ 2 / total_var) >= threshold)
num_comp $x[, seq_len(num_comp)]
data_pca
}
<- function(x, threshold = 0.9, centers = 2) {
pca_kmeans <- as.matrix(x)
data_matrix <- pca_threshold(data_matrix, threshold = threshold)
data_pca <- cluster::clara(data_pca, k = centers)
data_kmeans $cluster
data_kmeans }
Now we create some data and run profvis on it
<- diamonds %>%
large_data select_if(is.numeric) %>%
sample_n(100000, replace = TRUE)
::profvis({
profvispca_kmeans(large_data)
})
And we get the following information back.
The Problem
It is very informative, but it is also giving a LOT of information. Let’s propose we want to know the percentage of the computation time is used to do the PCA calculations. In the profvis
framework you would need to do the calculation manually. All while waiting for the HTML widget to load.
The Idea
profvis
uses the Rprof
function internally to inspect what is happening. By using Rprof
directly we can extract the profile and calculate our output/matrix.
The base profiling steps are
<- tempfile()
tmp Rprof(tmp)
##################
# Code goes here #
##################
Rprof(NULL)
<- readLines(tmp) profile
This chunk will set up a temporary file, start the profiler and set it to write to the temporary file, stop the profiler and read the result from the profiler.
Trying it with our code we get
<- tempfile()
tmp Rprof(tmp)
<- pca_kmeans(large_data)
x Rprof(NULL)
<- readLines(tmp)
profile
head(profile)
## [1] "sample.interval=20000"
## [2] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [3] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [4] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [5] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [6] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
Let’s see what these lines mean. First, we notice that the first line is just denoting the sample interval, so we can ignore that for now. Let’s look at the next line
## [1] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
This is a snapshot of the “call-stack”, and it reads inside-out. So we have that aperm.default
is called inside aperm
which is called inside apply
which is called inside scale.default
and so on and so forth all the way up to pca_kmeans
.
Now that we know how Rprof
works, we can write some code that checks whether “pca_threshold” appears in the call stack and then find the percentage.
The Solution
We can now create a function that will calculate the percentage of the time is being spent in a certain function.
<- function(expr, pattern) {
prof_procentage <- tempfile()
tmp Rprof(tmp)
exprRprof(NULL)
<- readLines(tmp)[-1]
profile
mean(grepl(pattern, profile))
}
This function can now easily be used in our calculation.
prof_procentage(
<- pca_kmeans(large_data),
x pattern = "pca_threshold"
)## [1] 0.875
And this is how to create a custom profiler. Simply modify the last line in the skeleton function prof_procentage
to change its behavior.
the Extensions
The sky’s the limit! you are only limited by your regex abilities. You can also change the output. In the last example, I returned a numeric of the percentage, But we can also have the output be a plot
<- function(expr, pattern) {
prof_procentage_plot <- tempfile()
tmp Rprof(tmp)
exprRprof(NULL)
<- readLines(tmp)[-1]
profile
data.frame(x = grepl(pattern, profile)) %>%
ggplot(aes(x)) +
geom_bar()
}
prof_procentage_plot(
<- pca_kmeans(large_data),
x pattern = "pca_threshold"
)
The follow-up
After my initial announcement of this post, I got a helpful tweet from Hadley Wickham about the profvis::parse_rprof()
. In essence, it will help you parse the file you write with Rprof
to help you get to your answer faster and safer.
So some output like
## [1] "sample.interval=20000"
## [2] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [3] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [4] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [5] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [6] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
Will be transformed to a nice data.frame with profvis::parse_rprof()
## time depth label filenum linenum memalloc meminc filename
## 1 1 9 aperm.default NA NA 0 0 <NA>
## 2 1 8 aperm NA NA 0 0 <NA>
## 3 1 7 apply NA NA 0 0 <NA>
## 4 1 6 scale.default NA NA 0 0 <NA>
## 5 1 5 scale NA NA 0 0 <NA>
## 6 1 4 prcomp.default NA NA 0 0 <NA>
## 7 1 3 prcomp NA NA 0 0 <NA>
## 8 1 2 pca_threshold NA NA 0 0 <NA>
## 9 1 1 pca_kmeans NA NA 0 0 <NA>
## 10 2 9 is.na NA NA 0 0 <NA>
## 11 2 8 FUN NA NA 0 0 <NA>
## 12 2 7 apply NA NA 0 0 <NA>
## 13 2 6 scale.default NA NA 0 0 <NA>
## 14 2 5 scale NA NA 0 0 <NA>
## 15 2 4 prcomp.default NA NA 0 0 <NA>
## 16 2 3 prcomp NA NA 0 0 <NA>
## 17 2 2 pca_threshold NA NA 0 0 <NA>
## 18 2 1 pca_kmeans NA NA 0 0 <NA>
## 19 3 9 is.na NA NA 0 0 <NA>
## 20 3 8 FUN NA NA 0 0 <NA>
## 21 3 7 apply NA NA 0 0 <NA>
## 22 3 6 scale.default NA NA 0 0 <NA>
## 23 3 5 scale NA NA 0 0 <NA>
## 24 3 4 prcomp.default NA NA 0 0 <NA>
## 25 3 3 prcomp NA NA 0 0 <NA>
## 26 3 2 pca_threshold NA NA 0 0 <NA>
## 27 3 1 pca_kmeans NA NA 0 0 <NA>
## 28 4 9 is.na NA NA 0 0 <NA>
## 29 4 8 FUN NA NA 0 0 <NA>
## 30 4 7 apply NA NA 0 0 <NA>
## 31 4 6 scale.default NA NA 0 0 <NA>
## 32 4 5 scale NA NA 0 0 <NA>
## 33 4 4 prcomp.default NA NA 0 0 <NA>
## 34 4 3 prcomp NA NA 0 0 <NA>
## 35 4 2 pca_threshold NA NA 0 0 <NA>
## 36 4 1 pca_kmeans NA NA 0 0 <NA>
## 37 5 9 is.na NA NA 0 0 <NA>
## 38 5 8 FUN NA NA 0 0 <NA>
## 39 5 7 apply NA NA 0 0 <NA>
## 40 5 6 scale.default NA NA 0 0 <NA>
## 41 5 5 scale NA NA 0 0 <NA>
## 42 5 4 prcomp.default NA NA 0 0 <NA>
## 43 5 3 prcomp NA NA 0 0 <NA>
## 44 5 2 pca_threshold NA NA 0 0 <NA>
## 45 5 1 pca_kmeans NA NA 0 0 <NA>
session information
─ Session info ───────────────────────────────────────────────────────────────
setting value 4.1.0 (2021-05-18)
version R version 10.16
os macOS Big Sur .0
system x86_64, darwin17
ui X11 language (EN)
-8
collate en_US.UTF-8
ctype en_US.UTF/Los_Angeles
tz America2021-07-15
date
─ Packages ───────────────────────────────────────────────────────────────────* version date lib source
package 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
assertthat 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
backports 1.3.2 2021-06-09 [1] Github (rstudio/blogdown@00a2090)
blogdown 0.22 2021-04-22 [1] CRAN (R 4.1.0)
bookdown 0.7.8 2021-06-24 [1] CRAN (R 4.1.0)
broom 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
bslib 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cellranger 3.0.0 2021-06-30 [1] CRAN (R 4.1.0)
cli 0.7.1 2020-10-08 [1] CRAN (R 4.1.0)
clipr 2.1.2 2021-04-17 [1] CRAN (R 4.1.0)
cluster 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
codetools 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
colorspace 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
crayon 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
DBI 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
dbplyr 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
desc * 0.2.1 2020-01-12 [1] CRAN (R 4.1.0)
details 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
digest * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
dplyr 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
ellipsis 0.14 2019-05-28 [1] CRAN (R 4.1.0)
evaluate 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
fansi 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
farver * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
forcats 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
fs 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
generics * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggplot2 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
glue 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
gtable 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
haven 0.9 2021-04-16 [1] CRAN (R 4.1.0)
highr 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
hms 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
htmltools 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
httr 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jquerylib 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
jsonlite * 1.33 2021-04-24 [1] CRAN (R 4.1.0)
knitr 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
labeling 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
lifecycle 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
lubridate 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
magrittr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
modelr 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
munsell 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
pillar 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgconfig 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
png * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
purrr 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
R6 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
Rcpp * 1.4.0 2020-10-05 [1] CRAN (R 4.1.0)
readr 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
readxl 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
reprex 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
rlang 2.9 2021-06-15 [1] CRAN (R 4.1.0)
rmarkdown 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rprojroot 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rstudioapi 1.0.0 2021-03-09 [1] CRAN (R 4.1.0)
rvest 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
sass 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
scales 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
sessioninfo 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
stringi * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
stringr * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
tibble * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
tidyr 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyselect * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tidyverse 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
utf8 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vctrs 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
withr 0.24 2021-06-15 [1] CRAN (R 4.1.0)
xfun 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
xml2 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
yaml
1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library [