Friday, April 21, 2017

Sentiment Analysis with R and tidytext

In my last post I have demonstrated how R can be used to scrape information from the internet. In particular I have scraped the archives of DailyHaiku for their poetic treasures.

Today I want to show how to perform a sentiment analysis on those Haiku. Sentiment analysis is a very popular sub-area of natural language processing that is used to systematically identify, extract, and quantify affective states from text. In the most basic form it tells you whether a statement in form of a word, sentence, paragraph, or even book is positive or negative.

For the demonstration I will use R and the tidytext package, because I just love how tidytext integrates into the R tidyverse. However, several alternative packages for and other programming languages exist. E.g., if you are into python you should really check out the Natural Language Toolkit.


Before we begin we have to do some groundwork. As mentioned I will use data from a previous post - if you have missed that one you can download the needed R object here and use load() to add it to your environment, or you can run the code-block below to achieve the same.

if (!exists("haiku_clean")){
  if (!file.exists("haiku_clean.RData")){
    res <- tryCatch(download.file("",
                         "haiku_clean.RData", mode = "wb"),
                error=function(e) 1)

Now, there should be a haiku_clean object in your environment (if you are unsure you can test it with exists("haiku_clean")).

To make the individual Haiku easily identifiable, we number them consecutively.

haiku_clean <- haiku_clean %>% 
  mutate (h_number = row_number())

Next, we load the required packages for today’s task.


Note: Today’s code examples will rely heavily on piping with the “%>%” operator. Although piping IMHO is much more human readable than traditional R-code, some readers might still be overwhelmed by multiple consecutive pipes. One of the beauty of pipeliness, however, is that you can break them before each “%>%” and see what the intermediate result until this point is. This way you can easily trace the changes in data and data organisat