Thursday, April 27, 2017

Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym

Last week we saw how to assign sentiment to words. Although it worked reasonably well, there were some problems. For instance, many words in our text had no counterpart in the sentiment lexicons. This has multiple reasons:

  • not every word has a sentiment
  • the lexicons were created for other types of text

and finally

  • we haven’t cleaned our data!

Data cleaning is always an important part in every data analysis - this applies to dealing with words as well.

Today I want to show you three cleaning techniques for words in R:

  1. Stemming

  2. Lemmatization

  3. Replacing with more common synonym

But first we need some data to experiment on. For simplicity I will use the haiku_tidy object from my last post - if you have missed that one, you can download the needed R object here and use load() to add it to your environment, or you can run the code-block below to achieve the same.

if (!exists("haiku_tidy")){
  if (!file.exists("haiku_tidy.RData")){
    res <- tryCatch(download.file("http://bit.ly/haiku_tidy",
                         "haiku_tidy.RData", mode = "wb"),
                error=function(e) 1)
  }
  load("haiku_tidy.RData")
}

Next we need some basic R packages for the work flow. Further packages for the particular cleaning steps will be added when needed.

library(tidyverse) # R is better when it is tidy
library(stringr)  # for string manipulation

To speed things up, we will not work on every word instance but filter out only the unique ones. If needed the results for the unique word instance can be easily mapped back to the larger data-frame. Some of the techniques I am going to present are quite computational expansive, so if you have a much larger data set, then maybe they are not feasible. I have added system.time() to the particular work-steps so that you can see and decide for yourself.

Further, we apply some basic cleaning:

  • removing the possessive ending: ’s
  • removing all words containing non alphabetic characters (depending on the task at hand this might be a bad idea - e.g., in social media emoticons can be very informative)
  • removing missing values
lemma_unique <- haiku_tidy %>%
  select(word) %>%
  mutate(word_clean = str_replace_all(word,"\u2019s|'s","")) %>%
  mutate(word_clean = ifelse(str_detect(word_clean,"[^[:alpha:]]"),NA,word_clean)) %>%
  filter(!duplicated(word_clean)) %>%
  filter(!is.na(word_clean)) %>%
  arrange(word)


Table 1: head(lemma_unique)

word word_clean
<U+044F>’s <U+044F>
abandon abandon
abandoned abandoned
abandoning abandoning
absent absent
absently absently

Stemming

We can see in Table 1 that many words are very similar, e.g.,

  • abandon, abandoned, abandoning
  • add, added, adding
  • apologies, apologize, apology

Based on specific rules these words can be reduced to their (word) stems. This process is called stemming. In R this can be done with the SnowballC package.

library(SnowballC)
system.time(
  lemma_unique<-lemma_unique %>%
    mutate(word_stem = wordStem(word_clean, language="english"))
)
#>    user  system elapsed 
#>    0.02    0.00    0.02

Positive points for stemming are:

  • It is super fast (just take a look at the system.time())
  • Algorithms exist for many languages
  • It groups together related words

Negative points for stemming are:

  • Stems are not always words themselves (very problematic if you plan to work with a lexicon)
  • Sometimes words are grouped together, which are not related
  • Sometimes related words are not groups together

Lemmatization

In contrast to stemming, lemmatization is much more sophisticated. Lemmatization is the process of grouping together the inflected forms of a word. The resulting lemma can be analyzed as a single item.

In R itself there is no package for lemmatization. However, the package koRpus is a wrapper for the free third-party software TreeTagger. Several wrappers for other programming languages (e.g., Java, Ruby, Python, …) exist as well. Before you continue download TreeTagger and install it on your computer. Don’t forget that you have to download the parameter file for the needed language as well.

TreeTagger has to be installed on your system for the next step to work!

library(koRpus)
system.time(
  lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual", 
                          format="obj", TT.tknz=FALSE , lang="en",
                          TT.options=list(
                            path="c:/BP/TreeTagger", preset="en")
                          )
)
#>    user  system elapsed 
#>    0.64    0.17    1.50

This took considerably longer than stemming, but even for larger text corpi it should finish in a reasonable time, especially if you lemmatize only the unique words and map the result back to all instances.

Note: If you input each word on its own (like we just did) instead of entering whole sentences, then TreeTagger’s wclass (word-class) tag might be wrong. Depending on the job at hand this can be a problem. Does it matter to you whether e.g., love is the noun or the verb?

From the lemma_tagged object we need the TT.res table.

lemma_tagged_tbl <- tbl_df(lemma_tagged@TT.res)

We join this table with the data-frame of unique words and skip words with no identified lemma.

lemma_unique <- lemma_unique %>% 
  left_join(lemma_tagged_tbl %>%
              filter(lemma != "<unknown>") %>%
              select(token, lemma, wclass),
             by = c("word_clean" = "token")
            ) %>%
  arrange(word)

Positive points for lemmatization are:

  • It overcomes the three major problems of stemming (results are always words; neither does it group together unrelated words, nor does it miss to group together related words)
  • If you have to link your data to further lexicons - often lemmatized versions exist, they are smaller and so data-joins are faster
  • The TreeTagger supports many languages

Negative points for lemmatization are:

  • It is computational more expansive than stemming
  • Different words meaning the same thing (synonyms) are not grouped together

Replacing with more common synonym

Using lemmata instead of arbitrary word inflections helps to group together all forms of one word. However, we are often interested in the meaning of the word and not in the meaning’s particular representation.

To solve this problem we can look up the most common synonym of the words.

For English words we can use the famous WordNet and its R wrapper in the wordnet package.

WordNet has to be installed on your system for the next step to work!

library(wordnet)

The synonyms() function supports not all TreeTagger word-classes, so we will use a little wrapper to have it return NA in those cases instead of throwing an error.

  synonyms_failsafe <- function(word, pos){
    tryCatch({
      syn_list = list(syn=synonyms(word, toupper(pos)))
      if (length(syn_list[["syn"]])==0) syn_list[["syn"]][[1]] = word
      syn_list
    },
    error = function(err){
      return(list(syn=word))
    })
  }
system.time(
  lemma_unique <- lemma_unique %>%
    mutate(word_synonym = map2(lemma, wclass,synonyms_failsafe))
)
#>    user  system elapsed 
#>  515.99   35.60  379.38

Finding all synonyms really took long!

To identify the most common synonym we have to use a word frequency list. In this case we will rely on one extracted from the British National Corpus. Several lists were compiled by Adam Kilgarriff - we can use lemmatized version (all synonyms by wordnet are lemmata).

The word-class in the list are partly abbreviated, so we have to modify it a little to match it with wclass.

if (!exists("word_frequencies")){
  if (!file.exists("lemma.num")){
    res <- tryCatch(download.file("http://www.kilgarriff.co.uk/BNClists/lemma.num",
                         "lemma.num", mode = "wb"),
                error=function(e) 1)
  }
  word_frequencies <- 
    readr::read_table2("lemma.num",
                    col_names=c("sort_order", "frequency", "word", "wclass"))
  
  # harmonize wclass types with existing
  word_frequencies <- word_frequencies %>%
    mutate(wclass = case_when(.$wclass == "conj" ~ "conjunction",
                              .$wclass == "adv" ~ "adverb",
                              .$wclass == "v" ~ "verb",
                              .$wclass == "det" ~ "determiner",
                              .$wclass == "pron" ~ "pronoun",
                              .$wclass == "a" ~ "adjective",
                              .$wclass == "n" ~ "noun",
                              .$wclass == "prep" ~ "preposition")
           )
    
}

We build a little function to return the most frequent synonym. In case none of the synonyms was in the frequency it returns NA. We replace those with the original lemma.

frequent_synonym <- function(syn_list, pos=NA, word_frequencies){
  syn_vector <- syn_list$syn
  
  if (!is.na(pos) && pos %in% unique(word_frequencies$wclass)){
    syn_tbl <- tibble(word = syn_vector,
                      wclass = pos)
  } else {
    syn_tbl <- tibble(word = syn_vector)
  }
  
  suppressMessages(
    syn_tbl <- syn_tbl %>%
        inner_join(word_frequencies) %>%
        arrange(frequency)
  )
    
  return(ifelse(nrow(syn_tbl)==0,NA,syn_tbl$word[[1]]))
}

Note: The frequent_synonym() function can exploit knowledge about the word-class. However, I didn’t use this feature as the word-class was extracted from single words not from words in a sentence and is therefore unreliable.

system.time(
  lemma_unique <- lemma_unique %>%
    mutate(synonym = map_chr(word_synonym, frequent_synonym, 
                             word_frequencies = word_frequencies)) %>%
    mutate(synonym = ifelse(is.na(synonym), lemma, synonym))
)
#>    user  system elapsed 
#>   17.96    0.00   17.95

Well, this took some time as well.

Positive points for replacing with more common synonym are:

  • words with the same meaning are grouped together

Negative points for replacing with more common synonym are:

  • It is computational expansive!

  • Loss of more fine grained information

  • At least in R there is no multi-language wrapper as far as I know. However, following the given example it should be easy enough to create your own solution, as long as you have a synonym and a frequency list. If no sensible frequency list is available, then you should be able to compile your own with Google ngram using the wrapper offered by the ngramr package.

Comparing the Results

To get a feeling of how the techniques work I encourage you to take a little time and flip through the whole results. If you have not tried out the given code yourself, then you can download the results as csv-file here. For a snippe

Table 2: head() of cleaned word versions in lemma_unique

word word_clean word_stem lemma synonym
<U+044F>’s <U+044F> <U+044F>
abandon abandon abandon abandon empty
abandoned abandoned abandon abandon empty
abandoning abandoning abandon abandon empty
absent absent absent absent absent
absently absently absent absently absently

To quickly get a hint of the usefulness of the cleaning techniques we can check how many words in our list can be linked to an sentiment in the Bing lexicon.

Note: The stemming results are not included, because the word stems are no good match for the given sentiment lexicon. In other analyses, which do not rely on lexicons, e.g., topic models, this is no problem.

n_orig <- lemma_unique %>% 
  inner_join(tidytext::get_sentiments("bing"),
             by=c("word" = "word")) %>% 
  nrow()

n_orig
#> [1] 519
n_clean <- lemma_unique %>% 
  inner_join(tidytext::get_sentiments("bing"),
             by=c("word_clean" = "word")) %>% 
  nrow()

n_clean
#> [1] 525
n_lemma <- lemma_unique %>% 
  inner_join(tidytext::get_sentiments("bing"),
             by=c("lemma" = "word")) %>% 
  nrow()

n_lemma
#> [1] 672
n_synonym <- lemma_unique %>% 
  inner_join(tidytext::get_sentiments("bing"),
             by=c("synonym" = "word")) %>% 
  nrow()

n_synonym
#> [1] 819

Out of the originally 5222 unique words a sentiment was assigned only to 519. Simply removing possessive endings increased the number to 525. The step for lemmatization is considerably larger and yields 672 assignments. After replacing the lemmata with their most common synonym, the number of assignments even soars to 819.

Closing Remarks

Data cleaning is essential. I hope the given example was illustrative and you have seen how it can profit your analyses.

Of course this overview was by no means exhaustive. E.g., it might be a good idea to check for typos and correct them. Did the TreeTagger on those words without lemma just fail because the word was misspelled? The hunspell package might be a good starting point for solving this problem. If you try it out, then please share your experiences in the comment section.

Another idea is to use WordNet for replacing with hypernyms (a broader category that includes the original word) instead of synonyms. This would allow to condense words to their concepts. Again, I’m very interested in your exploits. So don’t hesitate to share them.

If you have any questions or comments please post them in the comments section.

If something is not working as outlined here, please check the package versions you are using. The system I have used was:

sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] wordnet_0.1-11    koRpus_0.10-2     data.table_1.10.4
#>  [4] SnowballC_0.5.1   stringr_1.2.0     dplyr_0.5.0      
#>  [7] purrr_0.2.2       readr_1.1.0       tidyr_0.6.1      
#> [10] tibble_1.3.0      ggplot2_2.2.1     tidyverse_1.1.1  
#> [13] kableExtra_0.1.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] reshape2_1.4.2     rJava_0.9-8        haven_1.0.0       
#>  [4] lattice_0.20-34    colorspace_1.3-2   htmltools_0.3.5   
#>  [7] tidytext_0.1.2.900 yaml_2.1.14        XML_3.98-1.5      
#> [10] foreign_0.8-67     DBI_0.6-1          selectr_0.3-1     
#> [13] modelr_0.1.0       readxl_1.0.0       plyr_1.8.4        
#> [16] munsell_0.4.3      gtable_0.2.0       cellranger_1.1.0  
#> [19] rvest_0.3.2        psych_1.6.12       evaluate_0.10     
#> [22] knitr_1.15.1       forcats_0.2.0      parallel_3.3.2    
#> [25] highr_0.6          tokenizers_0.1.4   broom_0.4.2       
#> [28] Rcpp_0.12.10       scales_0.4.1       backports_1.0.5   
#> [31] jsonlite_1.2       mnormt_1.5-5       hms_0.3           
#> [34] digest_0.6.12      stringi_1.1.5      grid_3.3.2        
#> [37] rprojroot_1.2      tools_3.3.2        magrittr_1.5      
#> [40] lazyeval_0.2.0     janeaustenr_0.1.4  Matrix_1.2-8      
#> [43] xml2_1.1.1         lubridate_1.6.0    assertthat_0.2.0  
#> [46] rmarkdown_1.5      httr_1.2.1         R6_2.2.0          
#> [49] nlme_3.1-131

12 comments:

  1. Your post is really informative and learned a lot from it. I hope you continue to write such posts because they help a lot of people out there.

    ReplyDelete
    Replies
    1. Thanks a lot - such feedback keeps me writing this blog.

      Delete
  2. That was a great tutorial, thanks for you effors

    ReplyDelete
  3. I just copied this tutorial. Worked well, this was super helpful!

    ReplyDelete
  4. Hi Bernhard,

    I followed your steps on Lemmatization, but get the following error message:
    "Error: Manual TreeTagger configuration: "en" is not a valid preset!"

    This message is after running the following codes:
    lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual",
    format="obj", TT.tknz=FALSE , lang="en",
    TT.options=list(
    path="C:/TreeTagger", preset="en")
    )

    Do you know how to solve this problem? Thanks so much!

    ReplyDelete
    Replies
    1. Hi, sorry for the wait. Unfortunately, I can only guess what went wrong. Have you downloaded the english parameter file besides installing treetagger?

      Delete
    2. Hmmm. I am having he same problem on a Mac.
      "Error: Manual TreeTagger configuration: "en" is not a valid preset!"

      My installation of tree tagger is at /Applications/Tree-taggerdir
      /Applications/Tree-taggerdir/lib/english.par exists and is about 14Meg in size.

      Delete
    3. This comment has been removed by the author.

      Delete
    4. I am happy that it worked out for you in the end. Thank you for sharing your solution!

      Delete
    5. My solution was temporary unfortunately. Maybe I figured it out at least vaguely. I suspect difference in functions in the included packages. When I went into the R Suite environment and added all related packages I discovered tidyverse not checked. I checked that and several others I need and it worked again. No doubt what I did yesterday as well inadvertently. This might work for "Unknown" UnknownOctober 29, 2018 at 11:03 PM

      Delete
  5. I wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts. pometalni stroji

    ReplyDelete

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts