Date last run: 17Oct2019
Introduction
While browsing the internet I found 21 Recipes for Mining Twitter Data with rtweet from Bob Rudis and Paul Campbell. The corresponding github repository points to a blog entry with some background material. In this article I try to replay some of the recipes.
In two other articles I described how to find out which urls are generated
(url generation in the rtweet package)
and what is the program flow in the rtweet
functions
(Program flow in the rtweet package).
Recipe 1 Using OAuth to Access Twitter APIs
After installing the package rtweet I use the silent_library function to load the package. Following
the vignette and recipe1 I created a Twitter application that I called HOQC_31415
. See
(url generation in the rtweet package).
`%>%` <- magrittr::`%>%`
silent_library('rtweet')
token <- rtweet::create_token(
app = "HOQC_31415",
...
)
Recipe 2 Looking Up the Trending Topics
Recipe2 describes that we can lookup the trending topics in certain regios. The following functions exist:
- rtweet::trends_available() : to find the names and
WOEID
codes of regions for which trending information is available - rtweet::get_trends() : to retrieve the trending information for a region
To see which calls to the twitter API are done we can use the debug_httr_get
function:
# devtools::install_github("HanOostdijk/HOQCutil")
HOQCutil::debug_httr_get(
get_trends("united states")
)
#> [1] "https://api.twitter.com/1.1/trends/available.json"
#> [2] "https://api.twitter.com/1.1/trends/place.json?id=23424977"
HOQCutil::debug_httr_get(
get_trends("23424977",exclude=T)
)
#> [1] "https://api.twitter.com/1.1/trends/place.json?id=23424977&exclude=hashtags"
From this we see that get_trends
will call the trends/place
API directly if a WOEID
code is specified. Otherwise (a regio name is specified) it will call the trends/available
API to retrieve all codes, with this information the WOEID
code for this name will be looked up and with that code the trends/place
API will be called.
In the following section we look at the trending information the get_trends
function delivers. When using parse=F
we would expect to get json
output but it is a nested `data.frame (?)
t1=rtweet::get_trends("23424977",exclude=F)
class(t1)
#> [1] "tbl_df" "tbl" "data.frame"
names(t1)
#> [1] "trend" "url" "promoted_content"
#> [4] "query" "tweet_volume" "place"
#> [7] "woeid" "as_of" "created_at"
dim(t1)
#> [1] 50 9
dplyr::select(t1,trend,tweet_volume) %>%
dplyr::arrange(desc(tweet_volume)) %>%
dplyr::top_n(tweet_volume,n=5)
#> # A tibble: 5 x 2
#> trend tweet_volume
#> <chr> <int>
#> 1 Rest In Peace 134205
#> 2 #BrexitDeal 68493
#> 3 #ThursdayThoughts 61793
#> 4 #GOT7Comeback 60690
#> 5 #SaveLOONA 51865
# now without parsing:
t2=rtweet::get_trends("23424977",exclude=F,parse=F)
class(t2)
#> [1] "data.frame"
names(t2)
#> [1] "trends" "as_of" "created_at" "locations"
dim(t2)
#> [1] 1 4
t3 = t2$trends[[1]]
class(t3)
#> [1] "data.frame"
names(t3)
#> [1] "name" "url" "promoted_content"
#> [4] "query" "tweet_volume"
dim(t3)
#> [1] 50 5
Using again debug_httr_get
with the argument ret=json
we retrieve ‘real’ json output. The json data is only partly presented (see suppressing output for how this is done):
jsondata=HOQCutil::debug_httr_get( rtweet::get_trends("23424977"), ret='json' )
print(jsondata) # output partly suppressed
#> 001 [
#> 002 {
#> 003 "trends": [
#> 004 {
#> 005 "name": "#RIPElijah",
#> 006 "url": "http://twitter.com/search?q=%23RIPElijah",
#> 007 "promoted_content": null,
#> 008 "query": "%23RIPElijah",
#> 009 "tweet_volume": 38367
#> 010 },
#> ....
#> 347 {
#> 348 "name": "#elijaamericanhero",
#> 349 "url": "http://twitter.com/search?q=%23elijaamericanhero",
#> 350 "promoted_content": null,
#> 351 "query": "%23elijaamericanhero",
#> 352 "tweet_volume": null
#> 353 }
#> 354 ],
#> 355 "as_of": "2019-10-17T16:20:22Z",
#> 356 "created_at": "2019-10-17T16:16:14Z",
#> 357 "locations": [
#> 358 {
#> 359 "name": "United States",
#> 360 "woeid": 23424977
#> 361 }
#> 362 ]
#> 363 }
#> 364 ]
Looking at the code in rtweet:::get_trends_
we see that rtweet:::from_js
is also called when parse=F
. The full process is:
- retrieve the data in
raw
format from the API (via call tortweet:::TWIT
) - convert the
raw
format to standardjson
format (in call tortweet:::from_js
) - convert the
json
format tolist
format (also in call tortweet:::from_js
) - convert the
list
format todata frame
format (inrtweet:::get_trends_
whenparse=T
is specified)
So the conversion from json to list format is always done. I conclude that ‘parsing’ is the name for the process that converts the list format to a data frame format.
Recipe 3 and 4 Extracting and Searching Tweet Entities
Recipe 3 shows the data that is available when using the rtweet::search_tweets()
function to retrieve information about a hashtag (in this case #rstats
). Recipe 4 shows how such a request can be refined by adding custom search operators.
Let us follow the example given in Recipe 4 that restricts tweets with hashtag #rstats
to those that have a github
reference and no #datascience
hashtag.
numtweets = 300
rstats <- rtweet::search_tweets("#rstats url:github -#datascience", n=numtweets)
dim(rstats)
#> [1] 300 90
names(rstats)
#> [1] "user_id" "status_id"
#> [3] "created_at" "screen_name"
#> [5] "text" "source"
#> [7] "display_text_width" "reply_to_status_id"
#> [9] "reply_to_user_id" "reply_to_screen_name"
#> [11] "is_quote" "is_retweet"
#> [13] "favorite_count" "retweet_count"
#> [15] "quote_count" "reply_count"
#> [17] "hashtags" "symbols"
#> [19] "urls_url" "urls_t.co"
#> [21] "urls_expanded_url" "media_url"
#> [23] "media_t.co" "media_expanded_url"
#> [25] "media_type" "ext_media_url"
#> [27] "ext_media_t.co" "ext_media_expanded_url"
#> [29] "ext_media_type" "mentions_user_id"
#> [31] "mentions_screen_name" "lang"
#> [33] "quoted_status_id" "quoted_text"
#> [35] "quoted_created_at" "quoted_source"
#> [37] "quoted_favorite_count" "quoted_retweet_count"
#> [39] "quoted_user_id" "quoted_screen_name"
#> [41] "quoted_name" "quoted_followers_count"
#> [43] "quoted_friends_count" "quoted_statuses_count"
#> [45] "quoted_location" "quoted_description"
#> [47] "quoted_verified" "retweet_status_id"
#> [49] "retweet_text" "retweet_created_at"
#> [51] "retweet_source" "retweet_favorite_count"
#> [53] "retweet_retweet_count" "retweet_user_id"
#> [55] "retweet_screen_name" "retweet_name"
#> [57] "retweet_followers_count" "retweet_friends_count"
#> [59] "retweet_statuses_count" "retweet_location"
#> [61] "retweet_description" "retweet_verified"
#> [63] "place_url" "place_name"
#> [65] "place_full_name" "place_type"
#> [67] "country" "country_code"
#> [69] "geo_coords" "coords_coords"
#> [71] "bbox_coords" "status_url"
#> [73] "name" "location"
#> [75] "description" "url"
#> [77] "protected" "followers_count"
#> [79] "friends_count" "listed_count"
#> [81] "statuses_count" "favourites_count"
#> [83] "account_created_at" "verified"
#> [85] "profile_url" "profile_expanded_url"
#> [87] "account_lang" "profile_banner_url"
#> [89] "profile_background_url" "profile_image_url"
We use the function count_hashtags to display the 5 hashtags (other than #rstats
) that occur most in these tweets.
NB: 6 rows are returned because more than one count is equal to 12:
count_hashtags(rstats,blacklist='rstats',top=5)
#> # A tibble: 6 x 2
#> hashtags n
#> <chr> <int>
#> 1 tidytuesday 20
#> 2 sparkaisummit 17
#> 3 ggplot2 15
#> 4 xaringan 15
#> 5 dataviz 12
#> 6 machinelearning 12
Another way to select only tweets without the DataScience
tag is to omit the -#datascience
clause in the API request and use the remove_tweets_with_hashtags function. Of course this gives a different result
rstats <- rtweet::search_tweets("#rstats url:github", n=numtweets)
rstats2 = remove_tweets_with_hashtags(rstats,hashtags='datascience')
dim(rstats)
#> [1] 300 90
dim(rstats2)
#> [1] 216 90
count_hashtags(rstats2,blacklist='rstats',top=5)
#> # A tibble: 5 x 2
#> hashtags n
#> <chr> <int>
#> 1 sparkaisummit 17
#> 2 ggplot2 15
#> 3 xaringan 15
#> 4 tidytuesday 13
#> 5 machinelearning 12
jsondata=HOQCutil::debug_httr_get( rtweet::search_tweets("#rstats", n=2), ret='json' )
print(jsondata) # output partly suppressed
#> 001 {
#> 002 "statuses": [
#> 003 {
#> 004 "created_at": "Thu Oct 17 16:20:06 +0000 2019",
#> 005 "id": 1184866687213924352,
#> 006 "id_str": "1184866687213924352",
#> 007 "full_text": "You know when you're working with a large #rstat
#> s <U+0001F4E6> when R CMD check gives you a NOTE about the size of the i
#> nstalled R and doc folders <U+0001F631> #vegan",
#> 008 "truncated": false,
#> 009 "display_text_range": [
#> 010 0,
#> ....
#> 536 "favorite_count": 0,
#> 537 "favorited": false,
#> 538 "retweeted": false,
#> 539 "lang": "en"
#> 540 }
#> 541 ],
#> 542 "search_metadata": {
#> 543 "completed_in": 0.028,
#> 544 "max_id": 1184866687213924352,
#> 545 "max_id_str": "1184866687213924352",
#> 546 "next_results": "?max_id=1184866574450221061&q=%23rstats&count=2
#> &include_entities=1&result_type=recent",
#> 547 "query": "%23rstats",
#> 548 "refresh_url": "?since_id=1184866687213924352&q=%23rstats&result
#> _type=recent&include_entities=1",
#> 549 "count": 2,
#> 550 "since_id": 0,
#> 551 "since_id_str": "0"
#> 552 }
#> 553 }
API call used rtweet::search_tweets
url = HOQCutil::debug_httr_get(
rtweet::search_tweets("#rstats url:github -#datascience", n=numtweets)
)
purrr::map_chr(url,URLdecode)
#> [1] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&tweet_mode=extended"
#> [2] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&max_id=1184590069652475906&tweet_mode=extended"
#> [3] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&max_id=1184420239594930175&tweet_mode=extended"
To our surprise the rtweet::search_tweets
function did not one but 3 API requests. Apparently the full request for 300 tweets is done in 3 API requests of 100 tweets each.
Recipe 5 Extracting a Retweet’s Origins
In Recipe 5 we see how tweets are collected. For these tweets it is checked if they are retweets
: in that case the variable retweet_count
is greater than 0
. For these tweets the first part of the text of the retweet and the ‘screen names’ (including the author of the original tweet?) are given with the number of retweets of the original tweet.
rstats <- rtweet::search_tweets("#rstats -rstudio", n=500)
dplyr::filter(rstats, retweet_count > 0) %>%
dplyr::select(text, mentions_screen_name, retweet_count) %>%
dplyr::mutate(text = substr(text, 1, 30)) %>%
tidyr::unnest(cols = c(mentions_screen_name)) %>%
dplyr::arrange(desc(retweet_count)) %>%
head(5)
#> # A tibble: 5 x 3
#> text mentions_screen_name retweet_count
#> <chr> <chr> <int>
#> 1 The remarkable decline in infa toddrjones 251
#> 2 The remarkable decline in infa toddrjones 251
#> 3 The remarkable decline in infa toddrjones 251
#> 4 The remarkable decline in infa toddrjones 251
#> 5 The remarkable decline in infa toddrjones 251
To see which tweets are most retweeted the following code can be used:
dplyr::filter(rstats, retweet_count > 0) %>%
dplyr::select(retweet_status_id, retweet_text,
retweet_screen_name, retweet_count) %>%
dplyr::mutate(retweet_text = substr(retweet_text, 1, 30)) %>%
unique() %>%
dplyr::arrange(desc(retweet_count)) %>%
head(5)
#> # A tibble: 5 x 4
#> retweet_status_id retweet_text retweet_screen_n~ retweet_count
#> <chr> <chr> <chr> <int>
#> 1 11841815983439093~ The remarkable declin~ toddrjones 251
#> 2 11844808762913423~ Global population pro~ neilrkaye 168
#> 3 11698012134094110~ "Bayesian Linear Mixe~ Rbloggers 102
#> 4 11844843918521507~ [use case] Using rorc~ rOpenSci 58
#> 5 11822613635838320~ I'm delighted to anno~ coolbutuseless 58
An alternative is to use the retweets
filter as indicated in
custom search operators. However sometimes (?) there is a difference in the number of rows of rstats
and rstats2
.
rstats <- rtweet::search_tweets("#rstats -rstudio filter:retweets", n=500)
(nrow(rstats))
#> [1] 500
rstats2 <- dplyr::filter(rstats, retweet_count > 0)
(nrow(rstats2))
#> [1] 500
rstats2 %>%
dplyr::select(retweet_status_id, retweet_text,
retweet_screen_name, retweet_count) %>%
dplyr::mutate(retweet_text = substr(retweet_text, 1, 30)) %>%
unique() %>%
dplyr::arrange(desc(retweet_count)) %>%
head(5)
#> # A tibble: 5 x 4
#> retweet_status_id retweet_text retweet_screen_n~ retweet_count
#> <chr> <chr> <chr> <int>
#> 1 11841815983439093~ The remarkable declin~ toddrjones 251
#> 2 11844808762913423~ Global population pro~ neilrkaye 168
#> 3 11698012134094110~ "Bayesian Linear Mixe~ Rbloggers 102
#> 4 11840994444191047~ 24 Free #DataScience ~ KirkDBorne 68
#> 5 11844843918521507~ [use case] Using rorc~ rOpenSci 58
Retrieve fields from tweets of a timeline
We created the function force_json
(see appendix) to retrieve the output of a twitter request in json format. With the function fields_from_tweets we can select fields from this output.
In first instance I did not include the check
argument and therefore used the default check=T
. The following makes clear why this resulted in a list of length 2: check=T
does an additional call to the API to check for resource availability:
force_json(rtweet::get_timeline("xieyihui",n=5),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"
#> [2] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"
force_json(rtweet::get_timeline("xieyihui",n=5,check=F),ret='url')
#> [1] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"
timeline_tweets = force_json(rtweet::get_timeline("xieyihui",n=5,check=F) )
names = c('created_at','id_str','full_text','in_reply_to_status_id_str',
'in_reply_to_screen_name','user_name')
fields = list('created_at','id_str','full_text','in_reply_to_status_id_str',
'in_reply_to_screen_name',list('user','name'))
fields_from_tweets(timeline_tweets,names,fields) %>%
dplyr::mutate(full_text=stringr::str_sub(full_text,1,30))
#> # A tibble: 5 x 6
#> created_at id_str full_text in_reply_to_sta~ in_reply_to_scr~ user_name
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Wed Oct 16~ 118453~ @robinso~ 118453062386580~ xieyihui Yihui Xie
#> 2 Wed Oct 16~ 118453~ @robinso~ 118453000420352~ xieyihui Yihui Xie
#> 3 Wed Oct 16~ 118453~ @robinso~ 118241720999547~ robinson_es Yihui Xie
#> 4 Mon Oct 14~ 118359~ @nj_tier~ 117972004833361~ nj_tierney Yihui Xie
#> 5 Wed Oct 02~ 117947~ @jtrnyc ~ 117722894170009~ jtrnyc Yihui Xie
urls for rtweet
functions
force_json(rtweet::get_timeline("xieyihui",n=5),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"
#> [2] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"
force_json(rtweet::lookup_statuses("1182709902197825537"),ret='url')
#> [1] "https://api.twitter.com/1.1/statuses/lookup.json?id=1182709902197825537&tweet_mode=extended"
force_json(rtweet::get_followers("HanOostdijk",n= 10),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"
#> [2] "https://api.twitter.com/1.1/followers/ids.json?screen_name=HanOostdijk&count=10&cursor=-1&stringify_ids=TRUE"
force_json(rtweet::lookup_users("HanOostdijk"),ret='url')
#> [1] "https://api.twitter.com/1.1/users/lookup.json?screen_name=HanOostdijk"
Appendix
function silent_library
silent_library <- function (package_name, mywarnings = FALSE) {
suppressWarnings({
suppressPackageStartupMessages({
library(
package_name,
character.only = TRUE,
warn.conflicts = mywarnings,
quietly = !mywarnings,
verbose = mywarnings
)
})
})
}
suppressing output
The next code is an example of how to show only the relevant part of output. Instead of
print(jsondata)
use this to show only part (the first 10 and last 18 lines) of the output.
d = HOQCutil::cap.out(print(jsondata),lines=1:10,line_numbering=T)
cat('\t ....')
numlines = length(strsplit(jsondata,'\\n')[[1]])
d = HOQCutil::cap.out(print(jsondata),lines=(numlines-17):numlines,line_numbering=T)
function count_hashtags
count_hashtags <- function (df,blacklist=NULL,top=10) {
`%>%` <- magrittr::`%>%`
dplyr::select(df, hashtags) %>%
tidyr::unnest(cols = c(hashtags)) %>%
dplyr::mutate(hashtags = tolower(hashtags)) %>%
dplyr::count(hashtags, sort=TRUE) %>%
dplyr::filter(!is.na(hashtags)) %>%
dplyr::filter(!(hashtags %in% !! blacklist)) %>%
dplyr::top_n(!! top, n)
}
function remove_tweets_with_hashtags
remove_tweets_with_hashtags <- function (df,hashtags=NULL) {
`%>%` <- magrittr::`%>%`
tidyr::unnest(df,cols = c(hashtags)) %>%
dplyr::mutate(hashtags = tolower(hashtags)) %>%
dplyr::group_by(status_id) %>%
dplyr::filter(!any(hashtags %in% !! hashtags)) %>%
tidyr::nest(hashtags=c(hashtags)) %>%
dplyr::mutate(hashtags=list(unlist(hashtags,use.names = F))) %>%
dplyr::ungroup()
}
function force_json
force_json <- function (cmd,
tolist = T,
simplifyVector = F,
ret = 'json',
...) {
jsondata = HOQCutil::debug_httr_get(cmd, ret = ret)
if (ret == 'json' && tolist==T) {
if (class(jsondata) == "json") {
jsondata = jsonlite::fromJSON(jsondata, simplifyVector = simplifyVector, ...)
} else if (class(jsondata) == "list") {
jsondata = purrr::map(jsondata,
~jsonlite::fromJSON(.x, simplifyVector = simplifyVector, ...))
} else {
jsondata = NULL
}
}
jsondata
}
function fields_from_tweets
fields_from_tweets <- function(tweets,names,fields) {
field_from_tweet <- function(tweet,name,field) {
x= list(do.call(purrr::pluck,c(list(tweet),field,.default=NA)) )
names(x)= name
x
}
fields_from_tweet <- function(tweet,names,fields) {
purrr::map2_dfc(names,fields,~field_from_tweet(tweet,.x,.y))
}
purrr::map_dfr(tweets,~fields_from_tweet(.,names,fields))
}
SessionInfo
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] httr_1.4.1 rtweet_0.6.9
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 pillar_1.4.2 compiler_3.6.0
#> [4] prettyunits_1.0.2 tools_3.6.0 progress_1.2.0
#> [7] zeallot_0.1.0 digest_0.6.20 lifecycle_0.1.0
#> [10] jsonlite_1.6 evaluate_0.14 tibble_2.1.3
#> [13] pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
#> [16] curl_4.0 xfun_0.8 stringr_1.4.0
#> [19] dplyr_0.8.3 knitr_1.25 vctrs_0.2.0
#> [22] askpass_1.1 hms_0.4.2 tidyselect_0.2.5
#> [25] glue_1.3.1 R6_2.4.0 fansi_0.4.0
#> [28] rmarkdown_1.16 tidyr_1.0.0 purrr_0.3.2
#> [31] magrittr_1.5 backports_1.1.4 htmltools_0.3.6
#> [34] assertthat_0.2.1 utf8_1.1.4 stringi_1.4.3
#> [37] openssl_1.4.1 HOQCutil_0.1.13 crayon_1.3.4