Monday, September 18, 2017

Find neighboring words with a GloVe model and R's text2vec package - A analysis of tweets from followers of the Austrian alt-right movement "Die Identiaeren"

The alt-right movement seems to be on the rise, but are they really alternative rights or just plain old Nazis? To get more information on this topic I’ve decided to take a look at their social media footprint. Or to be more precise I want to look at the tweets of people who follow Identitaere_B (the official Twitter account of the Identitäre Bewegung Österreich = Austria’s version of the alt-right movement).

But I don’t want to read all these tweets to find one really shocking tweet which I then exploit as pars pro toto. Mainly, because I really, really, really don’t want to read their tweets, but secondly because those guys are really good at coining such a result just as a bedauerlicher Einzelfall [woeful isolated case]. Instead, we will do what I like best - doing some machine learning. Hence, the result will be replicable and reproducible so that nobody can reasonably claim it fake news.

So, here is the plan:

  1. get followers of Identitaere_B
  2. get timelines of the followers
  3. use the R package text2vec to build a GloVe model
  4. use the GloVe model to look up which words in the Identitären-Tweet-Corpus are most similar (the nearest neighbors) to the German words
  • Hitler
  • Faschismus [fascism]
  • Nationalsozialismus [National Socialism]
  • Juden [jews]
  • Muslime [moslems]

Preparations

We load the packages we will need.

library(tidyverse)
library(magrittr)
library(twitteR)
library(text2vec)

Get the Tweets

Wire up R to connect to Twitter

To retrieve tweets you a) need a Twitter account and b) register yourself as a Twitter developer and create an app.

After filling out some basic information about your app (name and how to you plan to use it) you can get the needed OAuth credentials from Keys and Access Tokens.

Assign your credentials accordingly:

consumer_key <- "your_consumer_key"
consumer_secret <- "your_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"

And use them to connect to Twitter:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#> [1] "Using direct authentication"

Get the Followers

We start by getting the official user account:

Identitaere_B <- getUser("Identitaere_B")

Next, we can retrieve all users who follow Identitaere_B:

i_followers <- Identitaere_B$getFollowers()

So how many followers does Identitaere_B have:

length(i_followers)
#> [1] 7694

Given that Twitter is quite a niche social media in Austria and Austria has only a population of about 8.8 millions that is more than I would have expected.

Unfortunately, we will not be able to get the timelines of all followers, because most likely some of them have protected accounts. Presumably, tweets from those accounts would be very interesting for my endeavor, but we can’t see their tweets unless we get manually approved by the owners of those accounts. Making up a fake profile and trying to get approved could solve this problem, but that is to much of a hassle to me (if you do it - please let me know, what you have found).

Consequently, we have to filter out the protected accounts to avoid getting an error by trying to access them. To do so we put the user vector into a tibble, create a column indicating, whether it is protected or not, and filter the data based on this column. Further, we add a column which contains the Id of the user and assigns numbers to the users consecutively.

i_follow_data <- tibble(i_followers) %>%
  mutate(id = map_chr(i_followers, function(x) x$getId()),
         protected = map_lgl(i_followers, function(x) x$protected)) %>%
  filter(!protected) %>%
  mutate(nr = row_number())

Let’s check how many user remained:

length(i_follow_data[["i_followers"]])
#> [1] 6996

Get the Timelines

To get a user’s timeline we can use the function userTimeline(). The problem is, if we want to get the timelines of all users then we have to use it repeatedly and that can cause problems. E.g., we might exceed the rate limits or have problems with our internet connection. Both would cause an error and errors normally stop running R processes.

Since I am in no hurry (and hopefully you are neither) we can circumvent those problems with a little wrapper. We do this by adding a pause between each request and repeat a request a certain number of times if it fails. Because, depending on the pause, this function might run quite a long time to process all users we add a print functionality to track the progress. You can probably easily come up with a more elegant solution, but for the time being it should work.

userTimeline_slow <- function(x, 
                              index = NULL, 
                              index_report = 250, 
                              n=100, 
                              sleep=1, 
                              maxRetry = 3){

  #repeat until timeline was downloaded or maxRetry exceeded
  tweet <- NULL
  attempt <- 1
  while (is.null(tweet) && attempt <= maxRetry){
    attempt <- attempt+1
    
    #pause
    Sys.sleep(sleep)
    
    #report progress
    if (!is.null(index) && index %% index_report == 0){
      print(paste("processing user nr.", index))
    }
    
    #try to get timeline
    try(
      tweet <- userTimeline(x, n=n),
      silent = TRUE
    )
  }
  
  if (is.null(tweet)) tweet <- list("error")
  
  return(tweet)
}

So let’s run this function for all users. We limit the number of tweets to 100 per user so that a few heavy users cannot skew the overall result.

Warning: This will run for hours! If you don’t need the most up-to-date results, then you can download mine and load them with readr::readRDS().

userTimeline() returns status objects. Like the user object status contains much more information than required. Information per se is great, but if we want to save and load a R data-frame that consists of complex objects, then it becomes a problem. Serialization of (large) hierarchical R objects requires large amounts of memory and time. To avoid the accompanying hassle we remove the user object and keep only its id. Likewise we won’t keep the whole status object, but only its text. To do so we write a small wrapper …

extract_text <- function(status_list){
  unlist(status_list) %>% map_chr(function(x) x$text)
}

… which we can plug into our tweets retrieval pipeline:

system.time(
  by_user <- i_follow_data %>%
    ##retrieve tweets
    mutate(tweets = map2(i_followers, nr, userTimeline_slow, index_report=100, n=100, sleep=5)) %>% 
    ##extract text from tweets
    mutate(text = map(tweets, possibly(extract_text, NA_character_))) %>%
    ##remove large objects
    select(-i_followers, -tweets)
)
#> [1] "processing user nr. 100"
#> [1] "processing user nr. 200"
#> [1] "processing user nr. 300"
#> [1] "processing user nr. 400"
#> [1] "processing user nr. 500"
#> [1] "processing user nr. 600"
#> [1] "processing user nr. 700"
#> [1] "processing user nr. 800"
#> [1] "processing user nr. 900"
#> [1] "processing user nr. 1000"
#> [1] "processing user nr. 1100"
#> [1] "processing user nr. 1200"
#> [1] "processing user nr. 1300"
#> [1] "processing user nr. 1400"
#> [1] "processing user nr. 1500"
#> [1] "processing user nr. 1600"
#> [1] "processing user nr. 1700"
#> [1] "processing user nr. 1800"
#> [1] "processing user nr. 1900"
#> [1] "processing user nr. 2000"
#> [1] "processing user nr. 2100"
#> [1] "processing user nr. 2200"
#> [1] "processing user nr. 2300"
#> [1] "processing user nr. 2400"
#> [1] "processing user nr. 2500"
#> [1] "processing user nr. 2600"
#> [1] "processing user nr. 2700"
#> [1] "processing user nr. 2800"
#> [1] "processing user nr. 2900"
#> [1] "processing user nr. 3000"
#> [1] "processing user nr. 3100"
#> [1] "processing user nr. 3200"
#> [1] "processing user nr. 3300"
#> [1] "processing user nr. 3400"
#> [1] "processing user nr. 3500"
#> [1] "processing user nr. 3600"
#> [1] "processing user nr. 3700"
#> [1] "processing user nr. 3800"
#> [1] "processing user nr. 3900"
#> [1] "processing user nr. 4000"
#> [1] "processing user nr. 4100"
#> [1] "processing user nr. 4200"
#> [1] "processing user nr. 4300"
#> [1] "processing user nr. 4400"
#> [1] "processing user nr. 4500"
#> [1] "processing user nr. 4600"
#> [1] "processing user nr. 4700"
#> [1] "processing user nr. 4800"
#> [1] "processing user nr. 4900"
#> [1] "processing user nr. 5000"
#> [1] "processing user nr. 5100"
#> [1] "processing user nr. 5200"
#> [1] "processing user nr. 5300"
#> [1] "processing user nr. 5400"
#> [1] "processing user nr. 5500"
#> [1] "processing user nr. 5600"
#> [1] "processing user nr. 5700"
#> [1] "processing user nr. 5800"
#> [1] "processing user nr. 5900"
#> [1] "processing user nr. 6000"
#> [1] "processing user nr. 6100"
#> [1] "processing user nr. 6200"
#> [1] "processing user nr. 6300"
#> [1] "processing user nr. 6400"
#> [1] "processing user nr. 6500"
#> [1] "processing user nr. 6600"
#> [1] "processing user nr. 6700"
#> [1] "processing user nr. 6800"
#> [1] "processing user nr. 6900"
#>     user   system  elapsed 
#>  3051.78    19.52 40538.52

Let’s check the size of by_user:

object.size(by_user)
#> 42460000 bytes

Well, although not tiny it is small enough. Let’s save it to disc:

write_rds(by_user, path="by_user.rds")

Analyze the Tweets

Get to know the data a little

At the moment each row in the data-frame represents a user and the cell tweets contains not a single value but a list. Let’s convert this data structure in something a little more accessible.

tweets <- unnest(by_user, text)

Now, the data is in a long format and looks like: Note: This returns a HTML table, the normal R shell can’t properly display - but R Markdown can.

tweets %>%
  sample_n(20) %>%
  knitr::kable(format="html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
id protected nr text
3779853317 FALSE 2131 https://t.co/tNwbMO9nF7
771808488913825792 FALSE 4227 Travel to Africa – Best African Travel Company https://t.co/28CIeZUOPj
816777136111493120 FALSE 3301 #phnotizen US-Grenze-5) Am Ende wird die Nationalgarde an der Grenze stehen. Mit Schiessbefehl.
4872861754 FALSE 5697 “The Great Disorientation” https://t.co/J3clPJ0NVw
2778549199 FALSE 5126 Niin kauhia pumppi, että kädet meinaa räjähtää. Hiilaritankkaus tekee välillä hyvää <ed><U+00A0><U+00BE><ed><U+00B4><U+0093><ed><U+00A0><U+00BD><ed><U+00B2><U+00AA>
4228230554 FALSE 3490 Einzellfälle?? https://t.co/rqZ0GIaUpl
3613555641 FALSE 2059 I support more (proper) referendums, including popularly initiated (and binding) ones. https://t.co/BXBw2pk1JM
127316382 FALSE 3988 Never Forget, #ZionismIsTerrorism: https://t.co/zcChjcuasF
3254930815 FALSE 5123 For another time.
826485879334137856 FALSE 2389 #WiegehtsDeutschland Ich brauche diese TV-Shows mit Politikern nicht, die vernebeln nur Massen von Menschen, lenken einfach ab von……..!
3846087802 FALSE 1033 Alle #TV_Anstalten helfen an der Gelddruckmaschine im #Fußballgeschäft weil sie selbst davon provitieren. https://t.co/3v3vEy3Dpa
1846072760 FALSE 3789 na super https://t.co/VcJwE4qmpr https://t.co/siZgM7Nw5D
3364647988 FALSE 4843 @ainyrockstar Respekt, meine Hochachtung für Ihren Schneid auch zu den “Schmuddelkindern” der Republik zu gehen…… https://t.co/yJZCMS5eiA
487488283 FALSE 3966 Freedland talks about an ideology that killed 10s of millions?? He’s talking about Jewish Bolshevik Communism, make… https://t.co/0bXe9PMKcu
1068706021 FALSE 6603 genau! https://t.co/DfhZPyHUiU
1278184561 FALSE 6468 @Miquwarchar mich hat der schon vor drei Wochen blockiert nachdem er auf Argumente nicht rausgeben konnte !
872001924534587392 FALSE 1890 Für wen stimmen eigentlich Parteilose? Oder würden die aus Prinzip niemanden wählen?
543294811 FALSE 3795 @Junge_Freiheit Dabei wird IMMER vergessen, dass es um die Opfer geht, die betrauert werden sollen. Das sind in der… https://t.co/M7W6naQLRS
1389881041 FALSE 4483 So geht man,gegen Terroristen vor. https://t.co/fKJDVstVVb
387001916 FALSE 1552 @business As a Finn I know this man will not even try in any case.

Next, we remove all users who have not tweeted anything.

tweets <- tweets %>%
  filter(!is.na(text))

That leaves us with …

tweets %$% unique(id) %>% length()
#> [1] 6257

… users.

Reading the tweet examples above we might realize that not all tweets are in German, however those are the only ones we are interested in. To extract them we have to classify the tweet language first. We can do this with the help of the textcat package. Note: This will take some time. If you have a multicore processor (which you most likely have), then you can speed things up with multidplyr.

tweets <- tweets %>%
  mutate(
    lang = textcat::textcat(text)
  )

Let’s take a look at what languages we have found and how often: Note: We transform the y-axis to log10 to see the variation among the less frequent languages.

tweets %>%
  group_by(lang) %>%
  count() %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  mutate(language = factor(lang),
         language = forcats::fct_reorder(language, n, .desc=T)) %>%
  ggplot(aes(x=language, y=n, fill = language)) +
    geom_bar(stat="identity") +
    scale_y_continuous(trans = "log10" ) +
    guides(fill = FALSE) +
    ggthemes::theme_tufte() + 
    theme(axis.text.x = element_text(angle = 90,
                                     hjust = 1,
                                     vjust = 0.25)) 

For a nationalist movement its followers are quite international.

Only …

tweets %>%
  group_by(lang) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  filter(lang=="german") %$%
  freq * 100
#> [1] 59.57656

… percent of all collected tweets are German. That is less than I hoped for, but still a sufficient amount.

Next, we remove all non-German tweets:

tweets_de <- tweets %>%
  filter(lang == "german")

This leaves us with …

tweets_de %$% unique(id) %>% length()
#> [1] 5537

… users.

Let’s take a look at their tweet count distribution (remember: we set 100 as maximum per user):

tweets_de %>%
  group_by(id) %>%
  count() %>%
  ggplot(aes(n)) +
    geom_histogram(bins = 25) +
    ggthemes::theme_tufte()

Well, that looks like an organic tweet distribution to me. Many tweet a little and few tweet a lot. Frequent tweeters will have a stronger impact on the model because the provide more data. However, the order of differences is not that big that the results will be unusable skewed.

Preprocess Tweets

The text2vec package needs the text data in a certain format, so we start with converting our data.

We begin extracting the tokens. Choosing the right form of tokens is no easy feed. In many cases just extracting the words is a viable solution, but for tweets I am not so sure. Given the size limit of tweets many people use non-letter characters (e.g., emojis) to convey their message with fewer characters. Those would be removed together with all other punctuation if we would use words as tokens. To avoid that we put everything which is surrounded by space characters into a token, but this leaves us with the problem, that all punctuation characters are included. To avoid that we remove single punctuation characters which trail behind a combination of letters (= words). Further, we remove any letter capitalization to reduce the number of permutations of one word.

Note: A further preprocessing step we could use to reduce the variability of words is to identify and lemmatize the words in text. However, in this post we will skip this step. If you want to learn more about lemmatization, then please take a look at my post Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym

tweets_de <- tweets_de %>%
  #base::tolower does not like emojis --> use stringi
  mutate(clean = stringi::stri_trans_tolower(text),
         clean = stringr::str_replace_all(clean, "([[:alpha:]]+)[[:punct:]]\\s", "\\1\\s"))

Next, we tokenize …

tokens <- space_tokenizer(tweets_de[["clean"]])

… and create a iterator for the tokens …

it = itoken(tokens, progressbar = FALSE)

… to create a vocabulary of all tokens (FYI: the text2vec_vocabulary object is actually a data.frame):

vocab <- create_vocabulary(it, 
                           stopwords = tokenizers::stopwords("de"))

To calculate a meaningful word vector for a token it should not be too uncommon. Here we remove all tokens with a frequency below five.

vocab <- prune_vocabulary(vocab, term_count_min = 5L)

Let’s take a look at a small sample of the vocabulary:

vocab %>%
  sample_n(20) %>%
  arrange(term_count) %>%
  knitr::kable(format="html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
term term_count doc_count
5 5
könnensist 5 5
versprechungen 6 6
läuft. 7 7
@z_eisberg 7 7
verschleiern 7 7
other 8 8
@marcboe 8 8
menschenverachtende 10 10
stock 10 10
cm 10 10
doe) 11 11
fragenswarum 12 12
zünden 15 15
ball 19 19
gipfel 30 30
verweigert 31 30
digitale 32 32
#aufschrei 46 46
damals 197 195

Let’s check whether the words we are interested in exist in the vocabulary. First we create a tibble with those words …

interest <- tibble(term = c("hitler", "faschismus", 
                            "nationalsozialismus",
                            "juden",                           
                            "muslime"

                            ),
                   term_english = c("hitler", "faschism",
                                    "National Socialism",
                                    "jews",
                                    "moslems"
                                    ))

… and then use semi_join to filter the vocabulary.

vocab %>%
  semi_join(interest) %>%
  knitr::kable(format="html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
#> Joining, by = "term"
term term_count doc_count
hitler 108 107
faschismus 52 52
nationalsozialismus 21 21
juden 145 139
muslime 329 328

OK, those frequencies should suffice. Let’s build the model!

Building a GloVe Model

We start by constructing a term-co-occurrence matrix (TCM). But first we declare how to transform our list of tokens into vector space …

vectorizer <- vocab_vectorizer(vocab)

… before we create the TCM:

tcm <- create_tcm(it, vectorizer, skip_grams_window = 2^3)

Now we can fit the GloVe word-embedding model:

glove <- GlobalVectors$new(word_vectors_size = 2^3, vocabulary = vocab, x_max = 2^4)
word_vectors <- glove$fit_transform(tcm, n_iter = 2^5)
#> INFO [2017-09-15 10:55:12] 2017-09-15 10:55:12 - epoch 1, expected cost 0.0445
#> INFO [2017-09-15 10:55:13] 2017-09-15 10:55:13 - epoch 2, expected cost 0.0349
#> INFO [2017-09-15 10:55:13] 2017-09-15 10:55:13 - epoch 3, expected cost 0.0313
#> INFO [2017-09-15 10:55:14] 2017-09-15 10:55:14 - epoch 4, expected cost 0.0293
#> INFO [2017-09-15 10:55:14] 2017-09-15 10:55:14 - epoch 5, expected cost 0.0278
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 6, expected cost 0.0267
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 7, expected cost 0.0259
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 8, expected cost 0.0252
#> INFO [2017-09-15 10:55:16] 2017-09-15 10:55:16 - epoch 9, expected cost 0.0246
#> INFO [2017-09-15 10:55:16] 2017-09-15 10:55:16 - epoch 10, expected cost 0.0241
#> INFO [2017-09-15 10:55:17] 2017-09-15 10:55:17 - epoch 11, expected cost 0.0236
#> INFO [2017-09-15 10:55:17] 2017-09-15 10:55:17 - epoch 12, expected cost 0.0233
#> INFO [2017-09-15 10:55:18] 2017-09-15 10:55:18 - epoch 13, expected cost 0.0229
#> INFO [2017-09-15 10:55:18] 2017-09-15 10:55:18 - epoch 14, expected cost 0.0227
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 15, expected cost 0.0224
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 16, expected cost 0.0222
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 17, expected cost 0.0220
#> INFO [2017-09-15 10:55:20] 2017-09-15 10:55:20 - epoch 18, expected cost 0.0218
#> INFO [2017-09-15 10:55:20] 2017-09-15 10:55:20 - epoch 19, expected cost 0.0216
#> INFO [2017-09-15 10:55:21] 2017-09-15 10:55:21 - epoch 20, expected cost 0.0215
#> INFO [2017-09-15 10:55:21] 2017-09-15 10:55:21 - epoch 21, expected cost 0.0213
#> INFO [2017-09-15 10:55:22] 2017-09-15 10:55:22 - epoch 22, expected cost 0.0212
#> INFO [2017-09-15 10:55:22] 2017-09-15 10:55:22 - epoch 23, expected cost 0.0211
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 24, expected cost 0.0210
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 25, expected cost 0.0209
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 26, expected cost 0.0208
#> INFO [2017-09-15 10:55:24] 2017-09-15 10:55:24 - epoch 27, expected cost 0.0207
#> INFO [2017-09-15 10:55:24] 2017-09-15 10:55:24 - epoch 28, expected cost 0.0206
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 29, expected cost 0.0205
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 30, expected cost 0.0205
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 31, expected cost 0.0204
#> INFO [2017-09-15 10:55:26] 2017-09-15 10:55:26 - epoch 32, expected cost 0.0203

Use GloVe model

Now we can use the word vectors to find their nearest neighbors in the text corpus of Identitären follower tweets:

result <- interest %>%
  mutate(
    #create similarity matrix
    cos_sim = map(term, 
                function (x) sim2(word_vectors, y=word_vectors[x,,drop=FALSE], method = "cosine", norm = "l2")),
    #convert matrix to tibble
    cos_sim_tbl = map(cos_sim, function(x) tibble(term = row.names(x), distance = x[,1]) %>%
                        #sort distance
                        arrange(desc(distance))),
    #find neighbors
    n1 = map_chr(cos_sim_tbl, function (x) x[[2,1]]),
    n2 = map_chr(cos_sim_tbl, function (x) x[[3,1]]),
    n3 = map_chr(cos_sim_tbl, function (x) x[[4,1]]),
    n4 = map_chr(cos_sim_tbl, function (x) x[[5,1]]),
    n5 = map_chr(cos_sim_tbl, function (x) x[[6,1]])
                       )
#> Warning: package 'bindrcpp' was built under R version 3.3.3

And print them as a table:

result %>%
  select(-starts_with("cos")) %>%
  knitr::kable(format="html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
term term_english n1 n2 n3 n4 n5
hitler hitler kultur werte slowakei leitkultur männer
faschismus faschism fan schatten lassen! werdensda nichtsman
nationalsozialismus National Socialism sollt reichs dienst galerieshttps://t.co/2bfblhn0tp niederlage
juden jews 10% gesucht sicherheit gefallenen mitglieder
muslime moslems moslems terror gewalt aktuelles 2002

Note: The glove model is not constructed deterministically, hence your results might differ slightly. Unfortunately, set.seed() does not seem to work.

For those of you who do not speak German we translate the neighbors to English with the Oxford Dictionary API. To be able to use the API you have to register your self and create an app. After confirming the terms of use you will get an application id and an application key which we need to authenticate our queries to the API.

Next, we write a little wrapper function to get the English translation of a German word from the API …

de2en <- function(word,app_id,app_key){
  require(httr)
  require(jsonlite)
  
  
  ##create query
  url <- paste0("https://od-api.oxforddictionaries.com:443/api/v1/entries/de/",
                word,
                "/translations=en")
  
  ##query
  res <- GET(url,
              add_headers(app_key = app_key,
                          app_id = app_id))

  ##if no result return word
  if (res$status_code != 200){
    return(word)
  }

  ##dig into the resulting json for
  ##the information we need
  results <- fromJSON(content(res, "text"))$results
  entries <- results$lexicalEntries[[1]][,1][[1]]
  firstTranslation <- entries$senses[[1]]$translations[1][[1]]$text

  ##return result
  ##return only first word
  if (length(firstTranslation) >= 1){
    return (firstTranslation[1])
  } else {
    ##sometimes res$status is 200
    ##and there is still no result
    ##--> return original
    return (word)
  }
  
}

and apply it to all columns containing neighboring words (replace your_app_key and your_app_id with the respective values):

result_en  <- result %>%
  select(-starts_with("cos")) %>%
  mutate_at(vars(starts_with("n")), 
            ##do not translate words
            ##containing non-alphabetical
            ##characters
            function(x) (ifelse(stringr::str_detect(x,"^[[:alpha:]]+$"),
                               map_chr(x,de2en, your_app_id, your_app_key),
                               x)))

Let’s take a look at the result:

result_en %>%
  knitr::kable(format="html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
term term_english n1 n2 n3 n4 n5
hitler hitler culture werte Slovakia primary culture männer
faschismus faschism fan shadow lassen! werdensda nichtsman
nationalsozialismus National Socialism sollt reichs work galerieshttps://t.co/2bfblhn0tp defeat
juden jews 10% gesucht safety gefallenen mitglieder
muslime moslems moslems terrorism power aktuelles 2002

Well, that worked … somehow, but not all words were translated. Among those werte is maybe of interest which translates to merits. Further, other translations of dienst, automatically translated to work, are service and duty. Lastly, I would not translate Gewalt to power but to violence.

All in all the results are quite stunning to me. Primary culture being a neighbor of Hitler and fan of faschism does not support their claim of having nothing to do with Nazis. On the bright side the neighbors of jews do not hint antisemitism, but their opinion of moslems as violent terrorists seems clear.

Closing Remarks

I hope the given example was illustrative and can help you to carry out your own GloVe model analyses.

Concerning the content there are a few things to keep in mind:

  1. The sample was not huge, so the resulting model will not be overly stable.
  2. We used tweets from all people following _Identitaere/B - some of those probably do not agree with alt-right ideology.
  3. Many followers had protected accounts - their tweets were not included in the analysis although they are probably even more interesting.

Anyway, the results seems quite plausible to me, but maybe this is just confirmation bias. I am interested in your thoughts on the technique and the topic.


If you like to learn more about getting data from Twitter, then I can recommend Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. The book is focused on data mining with Python, but the concepts are very well explained and can be easily translated to other plattforms (such as R).


If something is not working as outlined here, please check the package versions you are using. The system I used was:

sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] bindrcpp_0.2     text2vec_0.5.0   twitteR_1.1.9    magrittr_1.5    
#>  [5] dplyr_0.7.2      purrr_0.2.2.2    readr_1.1.1      tidyr_0.6.3     
#>  [9] tibble_1.3.3     ggplot2_2.2.1    tidyverse_1.1.1  kableExtra_0.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] reshape2_1.4.2       haven_1.1.0          lattice_0.20-34     
#>  [4] colorspace_1.3-2     htmltools_0.3.5      yaml_2.1.14         
#>  [7] rlang_0.1.1          foreign_0.8-67       glue_1.1.1          
#> [10] DBI_0.7              bit64_0.9-7          lambda.r_1.1.9      
#> [13] modelr_0.1.1         readxl_1.0.0         foreach_1.4.3       
#> [16] bindr_0.1            plyr_1.8.4           stringr_1.2.0       
#> [19] futile.logger_1.4.3  munsell_0.4.3        gtable_0.2.0        
#> [22] cellranger_1.1.0     rvest_0.3.2          codetools_0.2-15    
#> [25] psych_1.6.12         evaluate_0.10        knitr_1.16          
#> [28] forcats_0.2.0        parallel_3.3.2       highr_0.6           
#> [31] broom_0.4.2          Rcpp_0.12.12         backports_1.1.0     
#> [34] scales_0.4.1         RcppParallel_4.3.20  jsonlite_1.2        
#> [37] bit_1.1-12           mnormt_1.5-5         rjson_0.2.15        
#> [40] hms_0.3              digest_0.6.12        stringi_1.1.5       
#> [43] grid_3.3.2           rprojroot_1.2        tools_3.3.2         
#> [46] lazyeval_0.2.0       futile.options_1.0.0 pkgconfig_2.0.1     
#> [49] Matrix_1.2-8         data.table_1.10.4    xml2_1.1.1          
#> [52] lubridate_1.6.0      iterators_1.0.8      assertthat_0.2.0    
#> [55] rmarkdown_1.6        httr_1.2.1           R6_2.2.0            
#> [58] nlme_3.1-131

No comments:

Post a Comment

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts