Date last run: 14Mar2020
In an RStudio Community message the question was raised how to retrieve a table from a webpage that was generated by javascript
. The problem was that the page did not contain the table itself but only a reference to the javascript
code. Because I was busy with a similar project, I decided to see if I could solve it.
The suggestion to solve the problem was described in a stack overflow
entry but it did not work out for the questioner and myself. In the entry camile mentioned the Selenium
.
Therefore I decided to use the R package RSelenium. The following code extracts the table. The only problem is that it does not free the port. After running the code it is necessary to restart RStudio
. A restart of the R session
or closing the R project
(when more sessions are open) is not enough to free the port. In my latest experiments I could no longer create this blog entry until I restarted the computer. I begin to see the attraction of Docker
for these use cases.
HOQCutil::silent_library(c('RSelenium','rvest'))
rD <- rsDriver(browser = 'firefox',port=4567L,verbose=F)
remDr <- rD[["client"]]
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
remDr$navigate(url)
remDr$switchToFrame(NULL)
doc = xml2::read_html(remDr$getPageSource()[[1]])
df= rvest::html_table(doc)[[1]]
remDr$close()
# stop the selenium server
rD[["server"]]$stop()
#> [1] TRUE
rm(rD)
gc(verbose=F)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 799500 42.7 1561221 83.4 1134935 60.7
#> Vcells 1421543 10.9 8388608 64.0 2309339 17.7
# port is still in use (only after RStudio restart available again)
The table:
knitr::kable(df)
EPPOCode | Name | Type | Language | Preferred |
---|---|---|---|---|
SAPETR | Saperda tridentata | animal | Scientific | NA |
Session Info
This document was produced on 14Mar2020 with the following R environment:
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rvest_0.3.5 xml2_1.2.5 RSelenium_1.7.7
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.3 knitr_1.28 magrittr_1.5 rappdirs_0.3.1
#> [5] HOQCutil_0.1.19 R6_2.4.1 rlang_0.4.5 highr_0.8
#> [9] httr_1.4.1 stringr_1.4.0 caTools_1.17.1.1 tools_3.6.0
#> [13] xfun_0.10 binman_0.1.1 selectr_0.4-1 semver_0.2.0
#> [17] htmltools_0.4.0 askpass_1.1 yaml_2.2.0 openssl_1.4.1
#> [21] digest_0.6.23 assertthat_0.2.1 processx_3.4.1 purrr_0.3.3
#> [25] ps_1.3.0 bitops_1.0-6 curl_4.3 glue_1.3.1
#> [29] evaluate_0.14 wdman_0.2.5 rmarkdown_2.1 stringi_1.4.6
#> [33] compiler_3.6.0 XML_3.98-1.20 jsonlite_1.6.1