Tuesday, May 16, 2017

Topic models (LDA and CTM) with R and topicmodels

In the previous three posts I have 1. downloaded Haiku from DailyHaiku, 2. performed data cleaning on their texts, and 3. assigned sentiments to them. Today I would like to perform one last analysis on the Haiku. I will try to group the Haiku into groups with similar topics.

In machine learning and natural language processing this is normally done with topic models. Those statistical models try to discover the hidden semantic structures represented by abstract topics within a collection of documents. In this post I will try out Latent Dirichlet allocation (LDA), the most common topic model currently in use and an upgraded version of LDA: the Correlated Topic Model (CTM).

Both model types assume that:
  1. Every document is a mixture of topics. E.g., document 1 is 60% topic A, 30% topic B, and 10% topic C, while document 2 is 99% topic B and a half percent topic A and C each. This is in contrast to many other clustering algorithms (e.g., k-means), which assign each object to one distinct group only.
  2. Every topic is a mixture of words. E.g., in topic A the words “data”, “machine”, and “algorithm” are the most common, while in topic C the most common words are “homework”, “grade”, and “task” - the word “solution” is equally likely in both topics.
In contrast to LDA, CTM allows the topics to be correlated. Both model types are implemented in the R package topicmodels.

Although both model types are widely used and quite powerful, I have doubts whether they will be able to group Haiku by topic. The first problem is that Haiku are pretty short (the traditional form has only 17 syllables and the free form is even shorter). So there is not much information per document to work with but a lot of sparsity. Secondly, Haiku normally concentrate on the natural world - e.g., changing seasons, falling leaves, blossoming flowers and melting snow. Even when focusing on other themes like love or sadness, they normally use images from the natural world. In terms of finding different topics this might pose a serious problem, because most Haiku are quite similar in this regard.

Anyways, let’s give it a try. All those how2s and illustrative examples in which everything works perfectly and a transfer to one’s own problem is destined for frustration are boring anyhow.

Preparation

First, we will load the central libraries for this post. In some code-blocks I will use functions from other libraries or I want to emphasize the package - in those cases I will call the respective function directly by package_name::function_name().
library(magrittr) # for %$% operator 
library(tidyverse) # R is better when it is tidy
library(tidytext)  # for easy handling of text
library(ldatuning) # for mathematical hint of number of topics
library(topicmodels) # nomen est omen
As mentioned, we will use data from previous posts. If you have not followed those, you can download the needed data here and load() it or use the code below for your convenience.
if (!exists("haiku_clean") || !exists("lemma_unique")){
  if (!file.exists("text_lemma.RData")){
    res <- tryCatch(download.file("http://bit.ly/haiku_lemma",
                         "text_lemma.RData", mode = "wb"),
                error=function(e) 1)
  }
  load("text_lemma.RData")
}
Next, we add the row number as an ID for the Haiku.
haiku_clean <- haiku_clean %>% 
  mutate (h_number = row_number())
As a last general preparation step, we merge the Haiku with the list of lemmatized words (see Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym).
haiku_lemma <- haiku_clean %>% 
  unnest_tokens(word, text) %>% 
  select (h_number, word) %>%
  left_join (lemma_unique %>% 
               select(word:lemma, synonym))

Latent Dirichlet Allocation

To use text data in the topicmodels package, we have to rearrange it into a document-term matrix.
dtm_lemma <- haiku_lemma %>%
  count(h_number, lemma) %>%
  filter(!is.na(lemma)) %>%
  cast_dtm(h_number, lemma, n)
Basically, we are now ready to conduct the LDA if we only knew how many topics to extract.

Number of Topics

Determining the number of topic is no simple task and none with an unambiguously right answer either. From my point of view the most important criterium for selecting a solution is that it is useful. In most use cases documents are grouped by topics so that humans can filter and select documents based on those topics. Accordingly, the topics must make sense to humans. This can only be achieved by a real human comparing different solutions and deciding on which is the most sensible one. Further, different use cases might need different levels of granularity. While in some situations a few coarse topics suffice, other scenarios might call for a more fine grained differentiation.

That said, of course algorithms have been proposed to relieve us humans from the time consuming endeavor of skimming through a multitude of solutions to find the optimal number of topics. The ldatuning package implements four of them. Because I have no idea how many topics to expect and no requirements regarding granularity, I will try them out.

The topicmodels package allows quite a lot of fine tuning. Although the defaults are sensible, we will tweak the settings a little for demonstrative purpose. Concretely, we discard the first 2500 iterations with burnin, because the are most likely not good anyway. Further, we increase the number of iterations to 5000 and request 5 different starts from which we keep only the run with the best result. To make everything reproducible, we set an individual seed for each start. For ease of use we put those hyper-parameters in a list.
control_list_gibbs <- list(
  burnin = 2500,
  iter = 5000,
  seed = 0:4,
  nstart = 5,
  best = TRUE
)
ldatuning has implemented parallel processing. To save time I set mc.cores to 4 so that the search for the optimal number of topics is performed on all four of my CPU cores.
system.time(
  topic_number_lemma <- FindTopicsNumber(
    dtm_lemma,
    topics = c(seq(from = 2, to = 9, by = 1), seq(10, 20, 2), seq(25, 50, 5)),
    metrics = c( "Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
    method = "Gibbs",
    control = control_list_gibbs,
    mc.cores = 4L,
    verbose = TRUE
  )
)
#> fit models... done.
#> calculate metrics:
#>   Griffiths2004... done.
#>   CaoJuan2009... done.
#>   Arun2010... done.
#>   Deveaud2014... done.
#>    user  system elapsed 
#>    3.71    0.17  848.88
To quickly get an understanding of the results we take a look at their visualization.
FindTopicsNumber_plot(topic_number_lemma)
Deveaud, et al. (2014) and Juan, et al. (2009) opt for 2 topics and 3 seem fine as well. This is opposed by Arun, et al. (2010) suggesting 50 (the maximum number of tested topics). In between, the algorithm by Griffiths, et al. (2004) indicates 12 topic. And finally, 9 topics seems to be a good compromise between Griffiths, et al. (2004) and Juan, et al. (2009). That can hardly be considered as unambiguous. However, let’s not forget that Haiku pose a hard problem for LDA and consequently for deciding on the number of problems as well. As I lack Haiku domain knowledge to come up with a good knowledge guided alternative I will try all of the mentioned topic numbers.

To iterate over the different topic numbers we will use purrr::map().
para <- tibble(k = c(2,3,9,12,50))

system.time(
  lemma_tm <- para %>%
    mutate(lda = map(k, 
                   function(k) LDA(
                     k=k, 
                     x=dtm_lemma, 
                     method="Gibbs", 
                     control=control_list_gibbs
                     )
                   )
         )
)
#>    user  system elapsed 
#>  337.60    0.02  338.87
Next, we extract the gamma matrices containing the topic probabilities per document.
lemma_tm <- lemma_tm %>%
  mutate(lda_gamma = map(.x=lda, 
                         .f=tidytext::tidy, 
                         matrix="gamma"))
To quickly get an overview of the results we visualize the distribution of the maximum gamma value per document. To put the values into perspective we add a vertical line indicating p = 1/number of topics. This is the minimum value per solution marking a uniform distribution of topics within the document.
lemma_tm %>% 
  unnest(lda_gamma) %>% 
  group_by(k, document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>%
  #top_n(1, gamma) %>%
  ungroup() %>% 
  ggplot(aes(x=gamma, fill=factor(k))) +
    geom_histogram(bins = 20) +
    scale_fill_discrete(name = "Number of\nTopics") + 
    xlab("maximum gamma per document") +
    facet_wrap(~k) +
    geom_vline(aes(xintercept = 1/k), 
               tibble(k=lemma_tm %$% unique(k)),
               color="darkred")
Unfortunately, none of the different topic numbers produced an informative solution. All max values are in the vicinity of the uniform distribution thresholds. So basically, our solutions tell us that every document consists in almost equal proportions of all topics. That is not useful and I doubt that any other topic number would result in a dramatic better result. Hence, we give up on LDA for this task and move on to CTM.

Correlated Topic Model

The topicmodels package provides a wide array of options to fine tune CTMs, but we will only increase the number of starts and stay with the defaults for the rest.
control_list_ctm <- list(
  seed = 5:9,
  nstart = 5,
  best = TRUE
)
In lack of any clue about the number of topics we stick with the algorithm derived suggestions for LDA. For convenience, we add the corresponding CTMs to the lemma_tm object to have everything in one place.
system.time(
  lemma_tm <- lemma_tm %>%
    mutate(ctm = map(k, 
                     function(k) CTM(
                       k=k, 
                       x=dtm_lemma,
                       control=control_list_ctm
                       )
                     )
           )
)
#>     user   system  elapsed 
#> 26400.05     6.18 27683.50
This took some time! If you plan something similar, then it is probably a good idea to use parallel processing. With the multidplyr package this shouldn’t be too hard.

Again, we first extract the gamma matrices. Unfortunately, tidytext does not offer tidy() for CTM objects, so we have to do some reshaping on our own. For reusability we encapsulate the modification steps in a function.
tidy_ctm_gamma  <- function(CTM_object){
  CTM_object %>% 
    slot("gamma")  %>% 
    as_data_frame()  %>% 
    mutate (document = row_number()) %>% 
    gather(topic, gamma, -document) %>%
    mutate(topic = strtoi(stringr::str_sub(topic,2)))
}
And use purrr::map() to iterate the function over all CTM-models.
lemma_tm <- lemma_tm %>%
  mutate(ctm_gamma = map(.x=ctm, .f=tidy_ctm_gamma))
Again, we use visualization to quickly get an overview of the results.
lemma_tm %>% 
  unnest(ctm_gamma) %>% 
  group_by(k, document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>%
  #top_n(1, ctm_gamma) %>%
  ungroup() %>% 
  ggplot(aes(x=gamma, fill=factor(k))) +
    geom_histogram(bins = 15) +
    scale_fill_discrete(name = "Number of\nTopics") +
    facet_wrap(~k) +
    geom_vline(aes(xintercept = 1/k), 
               tibble(k=lemma_tm %$% unique(k)),
               color="darkred")
This looks a little bit better than the LDA results. The models for 2 and 3 topics still don’t differ much from an uniform topic assignment, but the models with higher topic count seem to perform better in this regard.

To get a better understanding of the topics with have to look at the beta matrices. Those contain the probability of each word within the topics (each topic includes every word, some are just very, very unlikely). For tidy beta matrices of LDA models we can use the tidytext::tidy() function.
lemma_tm <- lemma_tm %>%
  mutate(lda_beta = map(.x=lda, .f=tidytext::tidy, matrix = "beta"))
For analogous matrices of CTM objects we have to use our own little function for reshaping the original output.
tidy_ctm_beta  <- function(CTM_object){
  Terms  <- CTM_object %>% 
    slot("terms") 
  
  CTM_object %>% 
    slot("beta")  %>% 
    as_data_frame() %>%
    setNames(Terms) %>%
    mutate (topic = row_number()) %>% 
    gather(term, beta, -topic) %>%
    mutate(beta = exp(beta))
}
Next, we apply this function to all our CTMs.
lemma_tm <- lemma_tm %>%
  mutate(ctm_beta = map(.x=ctm, .f=tidy_ctm_beta))
Now that we have added the beta matrices to our lemma_tm object, let’s take a look at the top 5 terms per topic of the model with 9 topics.
lemma_tm  %>%
  unnest(ctm_beta) %>%
  filter(k==9) %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  arrange(topic, -beta)  %>%
  mutate(rank = paste0("rank ",row_number())) %>%
  ungroup() %>%
  select(-c(k, beta)) %>%
  spread(rank, term) %>%
  knitr::kable(format="html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
topic rank 1 rank 2 rank 3 rank 4 rank 5
1 day dog hand wing watch
2 snow autumn white breeze scent
3 moon night window late nest
4 spring sky evening blue home
5 rain leave morning song empty
6 fall leaf rise garden grass
7 wind red branch bird mountain
8 light fog water air crow
9 winter shadow cloud sun tree
Interesting as those rankings are, at least I’m unable to come up with good labels for these topics right away. In a useful solution a human should recognize coherence within and separability between topics. This seems not to be the case here and I cannot find a distinct theme per topic (if you do, please let me know in the comment section).

But maybe, if we only dig deep enough, these topics will start to make sense after all. To facilitate the further inspection of the 9 topics CTM, we extract its gamma matrix as a separate object and name the topics after their most common term.
k9_ctm_gamma <- lemma_tm %>% 
  unnest(ctm_gamma) %>% 
  filter(k==9) %>%
  left_join(lemma_tm  %>%
    unnest(ctm_beta) %>%
    filter(k==9) %>%
    group_by(topic) %>%
    top_n(1, beta) %>% 
    select(topic, term)) %>% 
    rename(`top term`=term)
We can use this 9 topic object to take a look at the distribution of the dominant topic within our Haiku corpus.
k9_ctm_gamma %>% 
  select (-k)  %>%
  group_by(document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>% 
  ungroup() %>%
  group_by(topic, `top term`) %>%
  summarize(n=n()) %>%
  ungroup() %>%
  mutate(percent=round(n/sum(n)*100, 1)) %>%
  knitr::kable(format="html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
topic top term n percent
1 day 356 11.1
2 snow 479 14.9
3 moon 476 14.8
4 spring 483 15.0
5 rain 388 12.1
6 fall 356 11.1
7 wind 377 11.7
8 light 294 9.1
9 winter 10 0.3
Only 0.3% of the Haiku are primarily assigned to the winter topic. Compared to the other proportions this seems negligible. Otherwise nothing specific catches my eye.

Next, let’s take a look at the correlation of topics (as the C in CTM stands for correlation). Scanning through the numbers of a large correlation matrix is tedious, so we visualize it as a heat-map.
## helper function to sort 
## correlation matrix
reorder_cor <- function(x){
  ord <- corrplot::corrMatOrder(x)
  x[ord,ord]
}

## helper function to extract 
## lower triangle of matrix
get_lower_tri<-function(x){
  x[upper.tri(x)] <- NA
  return(x)
}


## prepare data
cor_data <- k9_ctm_gamma %>% 
  select(-topic) %>% 
  spread(`top term`, gamma) %>%
  select(-c(k, document)) %>%
  cor() %>%
  reorder_cor() %>%
  get_lower_tri() %>%
  as_data_frame() %>%
  mutate(topic1 = forcats::as_factor(paste(names(.)))) %>%
  gather(topic2, correlation, - topic1) %>%
  mutate(topic2 = factor(topic2, levels=levels(topic1)))

## create plot  
cor_data %>% 
  ggplot(aes(as.numeric(topic1), as.numeric(topic2), fill=correlation)) +
    geom_tile(color="white") +
    scale_fill_gradient2(low = "firebrick4", high = "dodgerblue4", 
      midpoint = 0, limit = c(-1,1), space = "Lab",
      na.value = "white",
      name="Pearson\nCorrelation") +
    scale_x_continuous(
      breaks=1:9, labels = levels(cor_data$topic1), name="")+
    scale_y_continuous(
      breaks=1:9, labels = levels(cor_data$topic2), name="")+
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    coord_fixed()
Unfortunately, this isn’t overly informative either. Except for winter, which is quite rare in our corpus, all topics are weakly negative correlated with each other and at least I can’t spot any pattern.

So let’s try one more thing. Maybe we can get a better grasp of the topics by looking at the major differences between topics. E.g., which words are much more probable in one topic compared to another. The log of the word probability ratio gives us this information for pairwise comparisons. This metric is symmetric around zero. Large positive values of the log ratio indicate that the given word is more probable in topic 1 while negative values point to the opposite. We will try it exemplarily for the topics day and light. To avoid inconsequential results, we filter for words which are frequent in at least one of the two topics. At last, we visualize the 10 biggest differences of occurrence probability.
beta_day_light <- lemma_tm  %>%
  unnest(ctm_beta) %>%
  filter(k==9)%>%
  left_join(lemma_tm  %>%
    unnest(ctm_beta) %>%
    filter(k==9) %>%
    group_by(topic) %>%
    top_n(1, beta) %>% 
    rename(`top term`=term) %>% 
    select(topic, `top term`)) %>%
    filter(`top term` %in% c("day", "light")) %>%
  select(-c(k, topic)) %>%
  spread(`top term`, beta) %>%
  filter(day > .001 | light > .001) %>%
  mutate(log_ratio = log2(day / light))



beta_day_light %>% 
  top_n(10, abs(log_ratio)) %>%
  mutate(term = forcats::fct_reorder(term,log_ratio)) %>%
  ggplot(aes(x=term, y=log_ratio, fill=log_ratio<0)) +
    scale_fill_manual(values=c("gold2", "deepskyblue3"),
                      name="topic",
                      labels=c("light", "day")) +
    geom_col() + 
    coord_flip()
Well, at least for me this was not overly enlightening. Let’s try a different approach and look directly at light and day Haiku. First we have to merge the topics’ gamma values with the original Haiku in the haiku_clean object. We will do this for all topics and additionally identify the dominant topic for each Haiku.
dom_top <- k9_ctm_gamma %>%
  group_by(document) %>%
  top_n(1, gamma) %>%
  rename(dom_top = `top term`) %>%
  select(document, dom_top)

k9_ctm_gamma_wide <- k9_ctm_gamma %>%
  select(-c(k, topic)) %>%
  spread(`top term`, gamma) 

haiku_gamma <- haiku_clean %>%
  left_join(dom_top, by = c("h_number" = "document")) %>%
  left_join(k9_ctm_gamma_wide, by = c("h_number" = "document"))
Now, we can take a look at the top 3 light Haiku:
haiku_gamma %>% 
  filter (dom_top == "light") %>%
  top_n (3, light) %>%
  mutate(light = round(light,2))  %>%
  select (text, author, light) %>%
  knitr::kable(format="html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
text author light
collared doves cooing again the cold rain Polona Oblak 0.68
a blown kiss— pampas grass sways towards the river Rita Odeh 0.68
ah, grasshopper, such a robust click of wings for such tawny hills Richard Stevenson 0.67
And the top 3 day Haiku:
haiku_gamma %>% 
  filter (dom_top == "day") %>%
  top_n (3, day) %>%
  mutate(day = round(day,2))  %>%
  select (text, author, day) %>%
  knitr::kable(format="html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
text author day
second marriage bigger                    balloons LeRoy Gorman 0.62
busker at the market my coins sing in his guitar case Catherine McLaughlin 0.63
a kindred magpie we hoard all the shiny bits with raucous laughter Rebecca Traquair 0.66
Unfortunately, this did not help either and the 9 Topics solution still doesn’t make much sense for me and I doubt that it ever will. Normally, this would be a good point to change the number of topics and evaluate other solutions. For today’s post, however, I will calling it quits. If you further explore the Haiku corpus and come up with a viable solution, then please let me know in the comment section.

Closing Remarks

Although my attempt to group Haiku into topics ultimately failed, I hope the given example was illustrative and can help you to carry out your own topic model analyses. I wish you all the luck with better interpretable results. Chances are certainly in your favor, because of their brevity and their limitation to images from the natural world Haiku are a tough case for topic models. However, if your texts are similarily short, then maybe it is a good idea to check out Biterm Topic Models (BTM). You can find a BTM implementation for python 3 here.

If you have any questions or comments please post them in the comments section.

If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()
#> R version 3.3.1 (2016-06-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 14393)
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] gdtools_0.1.4     topicmodels_0.2-6 ldatuning_0.2.0  
#>  [4] tidytext_0.1.2    dplyr_0.5.0       purrr_0.2.2      
#>  [7] readr_1.1.0       tidyr_0.6.2       tibble_1.3.0     
#> [10] ggplot2_2.2.1     tidyverse_1.1.1   magrittr_1.5     
#> [13] kableExtra_0.1.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] modeltools_0.2-21 NLP_0.1-10        slam_0.1-40      
#>  [4] corrplot_0.77     reshape2_1.4.2    haven_1.0.0      
#>  [7] lattice_0.20-33   colorspace_1.3-2  htmltools_0.3.6  
#> [10] SnowballC_0.5.1   stats4_3.3.1      yaml_2.1.14      
#> [13] XML_3.98-1.7      foreign_0.8-68    DBI_0.6-1        
#> [16] selectr_0.3-1     modelr_0.1.0      readxl_1.0.0     
#> [19] plyr_1.8.4        stringr_1.2.0     munsell_0.4.3    
#> [22] gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2      
#> [25] psych_1.7.5       evaluate_0.10     labeling_0.3     
#> [28] knitr_1.15.1      forcats_0.2.0     tm_0.7-1         
#> [31] parallel_3.3.1    highr_0.6         broom_0.4.2      
#> [34] tokenizers_0.1.4  Rcpp_0.12.10      scales_0.4.1     
#> [37] backports_1.0.5   jsonlite_1.4      mnormt_1.5-5     
#> [40] hms_0.3           digest_0.6.12     stringi_1.1.5    
#> [43] grid_3.3.1        rprojroot_1.2     tools_3.3.1      
#> [46] lazyeval_0.2.0    janeaustenr_0.1.4 Matrix_1.2-10    
#> [49] xml2_1.1.1        lubridate_1.6.0   svglite_1.2.0    
#> [52] assertthat_0.2.0  rmarkdown_1.5     httr_1.2.1       
#> [55] R6_2.2.1          nlme_3.1-131

4 comments:

  1. Hi Bernhard, thanks for this great post!

    I was just wondering why did you use 5,000 iterations in your analysis? If I got it right, Griffiths & Steyvers (2004, PNAS https://www.pnas.org/content/101/suppl_1/5228) showed that Gibbs sampling methods stabilises around 250-300 iterations.

    Given that LDA analysis can take a long time I was wondering if setting the number of iterations to, say, 300 is a good idea, in your opinion.

    ReplyDelete
    Replies
    1. Hm.. I wasn't in a hurry, so it didn't matter too much to me. I guess I have used more iterations than necessary, but I am not sure that 300 would be enough. I would take a subsample of your data and test how stable the result is as iterations increase. Best, Bernhard

      Delete
    2. Thanks! I'll try that approach.

      Delete

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts