Recipes for the rtweet package

Date last run: 17Oct2019

Introduction

While browsing the internet I found 21 Recipes for Mining Twitter Data with rtweet from Bob Rudis and Paul Campbell. The corresponding github repository points to a blog entry with some background material. In this article I try to replay some of the recipes.
In two other articles I described how to find out which urls are generated (url generation in the rtweet package) and what is the program flow in the rtweet functions (Program flow in the rtweet package).

Recipe 1 Using OAuth to Access Twitter APIs

After installing the package rtweet I use the silent_library function to load the package. Following the vignette and recipe1 I created a Twitter application that I called HOQC_31415. See (url generation in the rtweet package).

`%>%` <- magrittr::`%>%`
silent_library('rtweet')
token <- rtweet::create_token(
	app = "HOQC_31415",
	...
)

Recipe2 describes that we can lookup the trending topics in certain regios. The following functions exist:

rtweet::trends_available() : to find the names and WOEID codes of regions for which trending information is available
rtweet::get_trends() : to retrieve the trending information for a region

To see which calls to the twitter API are done we can use the debug_httr_get function:

# devtools::install_github("HanOostdijk/HOQCutil") 
HOQCutil::debug_httr_get(
  get_trends("united states")
)
#> [1] "https://api.twitter.com/1.1/trends/available.json"        
#> [2] "https://api.twitter.com/1.1/trends/place.json?id=23424977"
HOQCutil::debug_httr_get(
  get_trends("23424977",exclude=T)
)
#> [1] "https://api.twitter.com/1.1/trends/place.json?id=23424977&exclude=hashtags"

From this we see that get_trends will call the trends/place API directly if a WOEID code is specified. Otherwise (a regio name is specified) it will call the trends/available API to retrieve all codes, with this information the WOEID code for this name will be looked up and with that code the trends/place API will be called.
In the following section we look at the trending information the get_trends function delivers. When using parse=F we would expect to get json output but it is a nested `data.frame (?)

t1=rtweet::get_trends("23424977",exclude=F)
class(t1)
#> [1] "tbl_df"     "tbl"        "data.frame"
names(t1)
#> [1] "trend"            "url"              "promoted_content"
#> [4] "query"            "tweet_volume"     "place"           
#> [7] "woeid"            "as_of"            "created_at"
dim(t1)
#> [1] 50  9
dplyr::select(t1,trend,tweet_volume) %>%
	dplyr::arrange(desc(tweet_volume))  %>%
	dplyr::top_n(tweet_volume,n=5) 
#> # A tibble: 5 x 2
#>   trend             tweet_volume
#>   <chr>                    <int>
#> 1 Rest In Peace           134205
#> 2 #BrexitDeal              68493
#> 3 #ThursdayThoughts        61793
#> 4 #GOT7Comeback            60690
#> 5 #SaveLOONA               51865

# now without parsing: 

t2=rtweet::get_trends("23424977",exclude=F,parse=F)
class(t2)
#> [1] "data.frame"
names(t2)
#> [1] "trends"     "as_of"      "created_at" "locations"
dim(t2)
#> [1] 1 4
t3 = t2$trends[[1]]
class(t3)
#> [1] "data.frame"
names(t3)
#> [1] "name"             "url"              "promoted_content"
#> [4] "query"            "tweet_volume"
dim(t3)
#> [1] 50  5

Using again debug_httr_get with the argument ret=json we retrieve ‘real’ json output. The json data is only partly presented (see suppressing output for how this is done):

jsondata=HOQCutil::debug_httr_get(  rtweet::get_trends("23424977"), ret='json'  )

print(jsondata) # output partly suppressed

#> 001 [
#> 002   {
#> 003     "trends": [
#> 004       {
#> 005         "name": "#RIPElijah",
#> 006         "url": "http://twitter.com/search?q=%23RIPElijah",
#> 007         "promoted_content": null,
#> 008         "query": "%23RIPElijah",
#> 009         "tweet_volume": 38367
#> 010       },
#> 	 ....
#> 347       {
#> 348         "name": "#elijaamericanhero",
#> 349         "url": "http://twitter.com/search?q=%23elijaamericanhero",
#> 350         "promoted_content": null,
#> 351         "query": "%23elijaamericanhero",
#> 352         "tweet_volume": null
#> 353       }
#> 354     ],
#> 355     "as_of": "2019-10-17T16:20:22Z",
#> 356     "created_at": "2019-10-17T16:16:14Z",
#> 357     "locations": [
#> 358       {
#> 359         "name": "United States",
#> 360         "woeid": 23424977
#> 361       }
#> 362     ]
#> 363   }
#> 364 ]

Looking at the code in rtweet:::get_trends_ we see that rtweet:::from_js is also called when parse=F. The full process is:

retrieve the data in raw format from the API (via call to rtweet:::TWIT)
convert the raw format to standard json format (in call to rtweet:::from_js)
convert the json format to list format (also in call to rtweet:::from_js)
convert the list format to data frame format (in rtweet:::get_trends_ when parse=T is specified)

So the conversion from json to list format is always done. I conclude that ‘parsing’ is the name for the process that converts the list format to a data frame format.

Recipe 3 and 4 Extracting and Searching Tweet Entities

Recipe 3 shows the data that is available when using the rtweet::search_tweets() function to retrieve information about a hashtag (in this case #rstats). Recipe 4 shows how such a request can be refined by adding custom search operators.
Let us follow the example given in Recipe 4 that restricts tweets with hashtag #rstats to those that have a github reference and no #datascience hashtag.

numtweets = 300
rstats <- rtweet::search_tweets("#rstats url:github -#datascience", n=numtweets)
dim(rstats)
#> [1] 300  90
names(rstats)
#>  [1] "user_id"                 "status_id"              
#>  [3] "created_at"              "screen_name"            
#>  [5] "text"                    "source"                 
#>  [7] "display_text_width"      "reply_to_status_id"     
#>  [9] "reply_to_user_id"        "reply_to_screen_name"   
#> [11] "is_quote"                "is_retweet"             
#> [13] "favorite_count"          "retweet_count"          
#> [15] "quote_count"             "reply_count"            
#> [17] "hashtags"                "symbols"                
#> [19] "urls_url"                "urls_t.co"              
#> [21] "urls_expanded_url"       "media_url"              
#> [23] "media_t.co"              "media_expanded_url"     
#> [25] "media_type"              "ext_media_url"          
#> [27] "ext_media_t.co"          "ext_media_expanded_url" 
#> [29] "ext_media_type"          "mentions_user_id"       
#> [31] "mentions_screen_name"    "lang"                   
#> [33] "quoted_status_id"        "quoted_text"            
#> [35] "quoted_created_at"       "quoted_source"          
#> [37] "quoted_favorite_count"   "quoted_retweet_count"   
#> [39] "quoted_user_id"          "quoted_screen_name"     
#> [41] "quoted_name"             "quoted_followers_count" 
#> [43] "quoted_friends_count"    "quoted_statuses_count"  
#> [45] "quoted_location"         "quoted_description"     
#> [47] "quoted_verified"         "retweet_status_id"      
#> [49] "retweet_text"            "retweet_created_at"     
#> [51] "retweet_source"          "retweet_favorite_count" 
#> [53] "retweet_retweet_count"   "retweet_user_id"        
#> [55] "retweet_screen_name"     "retweet_name"           
#> [57] "retweet_followers_count" "retweet_friends_count"  
#> [59] "retweet_statuses_count"  "retweet_location"       
#> [61] "retweet_description"     "retweet_verified"       
#> [63] "place_url"               "place_name"             
#> [65] "place_full_name"         "place_type"             
#> [67] "country"                 "country_code"           
#> [69] "geo_coords"              "coords_coords"          
#> [71] "bbox_coords"             "status_url"             
#> [73] "name"                    "location"               
#> [75] "description"             "url"                    
#> [77] "protected"               "followers_count"        
#> [79] "friends_count"           "listed_count"           
#> [81] "statuses_count"          "favourites_count"       
#> [83] "account_created_at"      "verified"               
#> [85] "profile_url"             "profile_expanded_url"   
#> [87] "account_lang"            "profile_banner_url"     
#> [89] "profile_background_url"  "profile_image_url"

We use the function count_hashtags to display the 5 hashtags (other than #rstats) that occur most in these tweets.

NB: 6 rows are returned because more than one count is equal to 12:

count_hashtags(rstats,blacklist='rstats',top=5)
#> # A tibble: 6 x 2
#>   hashtags            n
#>   <chr>           <int>
#> 1 tidytuesday        20
#> 2 sparkaisummit      17
#> 3 ggplot2            15
#> 4 xaringan           15
#> 5 dataviz            12
#> 6 machinelearning    12

Another way to select only tweets without the DataScience tag is to omit the -#datascience clause in the API request and use the remove_tweets_with_hashtags function. Of course this gives a different result

rstats <- rtweet::search_tweets("#rstats url:github", n=numtweets)
rstats2 = remove_tweets_with_hashtags(rstats,hashtags='datascience')
dim(rstats)
#> [1] 300  90
dim(rstats2)
#> [1] 216  90

count_hashtags(rstats2,blacklist='rstats',top=5)
#> # A tibble: 5 x 2
#>   hashtags            n
#>   <chr>           <int>
#> 1 sparkaisummit      17
#> 2 ggplot2            15
#> 3 xaringan           15
#> 4 tidytuesday        13
#> 5 machinelearning    12

jsondata=HOQCutil::debug_httr_get(  rtweet::search_tweets("#rstats", n=2), ret='json'  )

print(jsondata) # output partly suppressed

#> 001 {
#> 002   "statuses": [
#> 003     {
#> 004       "created_at": "Thu Oct 17 16:20:06 +0000 2019",
#> 005       "id": 1184866687213924352,
#> 006       "id_str": "1184866687213924352",
#> 007       "full_text": "You know when you're working with a large #rstat
#> s <U+0001F4E6> when R CMD check gives you a NOTE about the size of the i
#> nstalled R and doc folders <U+0001F631> #vegan",
#> 008       "truncated": false,
#> 009       "display_text_range": [
#> 010         0,
#> 	 ....
#> 536       "favorite_count": 0,
#> 537       "favorited": false,
#> 538       "retweeted": false,
#> 539       "lang": "en"
#> 540     }
#> 541   ],
#> 542   "search_metadata": {
#> 543     "completed_in": 0.028,
#> 544     "max_id": 1184866687213924352,
#> 545     "max_id_str": "1184866687213924352",
#> 546     "next_results": "?max_id=1184866574450221061&q=%23rstats&count=2
#> &include_entities=1&result_type=recent",
#> 547     "query": "%23rstats",
#> 548     "refresh_url": "?since_id=1184866687213924352&q=%23rstats&result
#> _type=recent&include_entities=1",
#> 549     "count": 2,
#> 550     "since_id": 0,
#> 551     "since_id_str": "0"
#> 552   }
#> 553 }

API call used rtweet::search_tweets

url = HOQCutil::debug_httr_get(
  rtweet::search_tweets("#rstats url:github -#datascience", n=numtweets)
)
purrr::map_chr(url,URLdecode)
#> [1] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&tweet_mode=extended"                           
#> [2] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&max_id=1184590069652475906&tweet_mode=extended"
#> [3] "https://api.twitter.com/1.1/search/tweets.json?q=#rstats url:github -#datascience&result_type=recent&count=100&max_id=1184420239594930175&tweet_mode=extended"

To our surprise the rtweet::search_tweets function did not one but 3 API requests. Apparently the full request for 300 tweets is done in 3 API requests of 100 tweets each.

Recipe 5 Extracting a Retweet’s Origins

In Recipe 5 we see how tweets are collected. For these tweets it is checked if they are retweets: in that case the variable retweet_count is greater than 0. For these tweets the first part of the text of the retweet and the ‘screen names’ (including the author of the original tweet?) are given with the number of retweets of the original tweet.

rstats <- rtweet::search_tweets("#rstats 	-rstudio", n=500)
dplyr::filter(rstats, retweet_count > 0) %>% 
  dplyr::select(text, mentions_screen_name, retweet_count) %>% 
  dplyr::mutate(text = substr(text, 1, 30)) %>% 
  tidyr::unnest(cols = c(mentions_screen_name)) %>%
  dplyr::arrange(desc(retweet_count)) %>%
  head(5)
#> # A tibble: 5 x 3
#>   text                           mentions_screen_name retweet_count
#>   <chr>                          <chr>                        <int>
#> 1 The remarkable decline in infa toddrjones                     251
#> 2 The remarkable decline in infa toddrjones                     251
#> 3 The remarkable decline in infa toddrjones                     251
#> 4 The remarkable decline in infa toddrjones                     251
#> 5 The remarkable decline in infa toddrjones                     251

To see which tweets are most retweeted the following code can be used:

dplyr::filter(rstats, retweet_count > 0) %>% 
  dplyr::select(retweet_status_id, retweet_text, 
                retweet_screen_name, retweet_count) %>% 
  dplyr::mutate(retweet_text = substr(retweet_text, 1, 30)) %>% 
  unique() %>%
  dplyr::arrange(desc(retweet_count)) %>%
  head(5)
#> # A tibble: 5 x 4
#>   retweet_status_id  retweet_text           retweet_screen_n~ retweet_count
#>   <chr>              <chr>                  <chr>                     <int>
#> 1 11841815983439093~ The remarkable declin~ toddrjones                  251
#> 2 11844808762913423~ Global population pro~ neilrkaye                   168
#> 3 11698012134094110~ "Bayesian Linear Mixe~ Rbloggers                   102
#> 4 11844843918521507~ [use case] Using rorc~ rOpenSci                     58
#> 5 11822613635838320~ I'm delighted to anno~ coolbutuseless               58

An alternative is to use the retweets filter as indicated in custom search operators. However sometimes (?) there is a difference in the number of rows of rstats and rstats2.

rstats  <- rtweet::search_tweets("#rstats 	-rstudio filter:retweets", n=500)
(nrow(rstats))
#> [1] 500
rstats2 <- dplyr::filter(rstats, retweet_count > 0) 
(nrow(rstats2))
#> [1] 500
rstats2 %>% 
  dplyr::select(retweet_status_id, retweet_text, 
                retweet_screen_name, retweet_count) %>% 
  dplyr::mutate(retweet_text = substr(retweet_text, 1, 30)) %>% 
  unique() %>%
  dplyr::arrange(desc(retweet_count)) %>%
  head(5)
#> # A tibble: 5 x 4
#>   retweet_status_id  retweet_text           retweet_screen_n~ retweet_count
#>   <chr>              <chr>                  <chr>                     <int>
#> 1 11841815983439093~ The remarkable declin~ toddrjones                  251
#> 2 11844808762913423~ Global population pro~ neilrkaye                   168
#> 3 11698012134094110~ "Bayesian Linear Mixe~ Rbloggers                   102
#> 4 11840994444191047~ 24 Free #DataScience ~ KirkDBorne                   68
#> 5 11844843918521507~ [use case] Using rorc~ rOpenSci                     58

Retrieve fields from tweets of a timeline

We created the function force_json (see appendix) to retrieve the output of a twitter request in json format. With the function fields_from_tweets we can select fields from this output.

In first instance I did not include the check argument and therefore used the default check=T. The following makes clear why this resulted in a list of length 2: check=T does an additional call to the API to check for resource availability:

force_json(rtweet::get_timeline("xieyihui",n=5),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"                                          
#> [2] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"
force_json(rtweet::get_timeline("xieyihui",n=5,check=F),ret='url')
#> [1] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"

timeline_tweets = force_json(rtweet::get_timeline("xieyihui",n=5,check=F) )

names = c('created_at','id_str','full_text','in_reply_to_status_id_str',
          'in_reply_to_screen_name','user_name')
fields =  list('created_at','id_str','full_text','in_reply_to_status_id_str',
          'in_reply_to_screen_name',list('user','name'))
fields_from_tweets(timeline_tweets,names,fields) %>%
  dplyr::mutate(full_text=stringr::str_sub(full_text,1,30))
#> # A tibble: 5 x 6
#>   created_at  id_str  full_text in_reply_to_sta~ in_reply_to_scr~ user_name
#>   <chr>       <chr>   <chr>     <chr>            <chr>            <chr>    
#> 1 Wed Oct 16~ 118453~ @robinso~ 118453062386580~ xieyihui         Yihui Xie
#> 2 Wed Oct 16~ 118453~ @robinso~ 118453000420352~ xieyihui         Yihui Xie
#> 3 Wed Oct 16~ 118453~ @robinso~ 118241720999547~ robinson_es      Yihui Xie
#> 4 Mon Oct 14~ 118359~ @nj_tier~ 117972004833361~ nj_tierney       Yihui Xie
#> 5 Wed Oct 02~ 117947~ @jtrnyc ~ 117722894170009~ jtrnyc           Yihui Xie

urls for `rtweet` functions

force_json(rtweet::get_timeline("xieyihui",n=5),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"                                          
#> [2] "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=xieyihui&count=5&tweet_mode=extended"
force_json(rtweet::lookup_statuses("1182709902197825537"),ret='url')
#> [1] "https://api.twitter.com/1.1/statuses/lookup.json?id=1182709902197825537&tweet_mode=extended"
force_json(rtweet::get_followers("HanOostdijk",n= 10),ret='url')
#> [1] "https://api.twitter.com/1.1/application/rate_limit_status.json"                                              
#> [2] "https://api.twitter.com/1.1/followers/ids.json?screen_name=HanOostdijk&count=10&cursor=-1&stringify_ids=TRUE"
force_json(rtweet::lookup_users("HanOostdijk"),ret='url')
#> [1] "https://api.twitter.com/1.1/users/lookup.json?screen_name=HanOostdijk"

Appendix

function silent_library

silent_library <- function (package_name, mywarnings = FALSE) {
    suppressWarnings({
        suppressPackageStartupMessages({
            library(
                package_name,
                character.only = TRUE,
                warn.conflicts = mywarnings,
                quietly = !mywarnings,
                verbose = mywarnings
            )
        })
    })
}

suppressing output

The next code is an example of how to show only the relevant part of output. Instead of

print(jsondata)

use this to show only part (the first 10 and last 18 lines) of the output.

d = HOQCutil::cap.out(print(jsondata),lines=1:10,line_numbering=T)
cat('\t ....')
numlines = length(strsplit(jsondata,'\\n')[[1]])
d = HOQCutil::cap.out(print(jsondata),lines=(numlines-17):numlines,line_numbering=T)

function count_hashtags

count_hashtags <- function (df,blacklist=NULL,top=10) {
  `%>%` <- magrittr::`%>%`
  dplyr::select(df, hashtags) %>%
    tidyr::unnest(cols = c(hashtags)) %>%
    dplyr::mutate(hashtags = tolower(hashtags)) %>%
    dplyr::count(hashtags, sort=TRUE)  %>%
    dplyr::filter(!is.na(hashtags)) %>%
    dplyr::filter(!(hashtags %in% !! blacklist)) %>%
    dplyr::top_n(!! top, n)
}

function remove_tweets_with_hashtags

remove_tweets_with_hashtags <- function (df,hashtags=NULL) {
  `%>%` <- magrittr::`%>%`
  tidyr::unnest(df,cols = c(hashtags)) %>%
    dplyr::mutate(hashtags = tolower(hashtags)) %>%
    dplyr::group_by(status_id) %>%
    dplyr::filter(!any(hashtags %in% !! hashtags)) %>%
    tidyr::nest(hashtags=c(hashtags)) %>%
    dplyr::mutate(hashtags=list(unlist(hashtags,use.names = F))) %>%
    dplyr::ungroup()
}

function force_json

force_json <- function (cmd,
                        tolist = T,
                        simplifyVector = F,
                        ret = 'json',
                        ...) {
  jsondata = HOQCutil::debug_httr_get(cmd, ret = ret)
  if (ret == 'json' && tolist==T) {
    if (class(jsondata) == "json") {
      jsondata = jsonlite::fromJSON(jsondata, simplifyVector = simplifyVector, ...)
    } else if (class(jsondata) == "list") {
      jsondata = purrr::map(jsondata,
                            ~jsonlite::fromJSON(.x, simplifyVector = simplifyVector, ...))
    } else {
      jsondata = NULL
    }
  }
  jsondata
}

function fields_from_tweets

fields_from_tweets <- function(tweets,names,fields) {
  field_from_tweet <- function(tweet,name,field) {
    x= list(do.call(purrr::pluck,c(list(tweet),field,.default=NA)) )
    names(x)= name
    x
  }
  fields_from_tweet <- function(tweet,names,fields) {
    purrr::map2_dfc(names,fields,~field_from_tweet(tweet,.x,.y))
  }
  purrr::map_dfr(tweets,~fields_from_tweet(.,names,fields))
}

SessionInfo

#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] httr_1.4.1   rtweet_0.6.9
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2        pillar_1.4.2      compiler_3.6.0   
#>  [4] prettyunits_1.0.2 tools_3.6.0       progress_1.2.0   
#>  [7] zeallot_0.1.0     digest_0.6.20     lifecycle_0.1.0  
#> [10] jsonlite_1.6      evaluate_0.14     tibble_2.1.3     
#> [13] pkgconfig_2.0.2   rlang_0.4.0       cli_1.1.0        
#> [16] curl_4.0          xfun_0.8          stringr_1.4.0    
#> [19] dplyr_0.8.3       knitr_1.25        vctrs_0.2.0      
#> [22] askpass_1.1       hms_0.4.2         tidyselect_0.2.5 
#> [25] glue_1.3.1        R6_2.4.0          fansi_0.4.0      
#> [28] rmarkdown_1.16    tidyr_1.0.0       purrr_0.3.2      
#> [31] magrittr_1.5      backports_1.1.4   htmltools_0.3.6  
#> [34] assertthat_0.2.1  utf8_1.1.4        stringi_1.4.3    
#> [37] openssl_1.4.1     HOQCutil_0.1.13   crayon_1.3.4

2019/10/17