Monday, April 17, 2017

Webscraping with R and rvest - How can I get all Haiku?

Motivation

I love the internet - all this information only a fingertip away. Unfortunately, most information is provided in unstructured text. Ready-made tabular data, as needed for most analytic purposes, is a rare exception.
E.g., I enjoy reading Haiku and DailyHaiku hooks me up with my daily dosage. They even have a huge archive with all previously published Haiku. What interests me is which percentage of Haiku is in the traditional form: 5-7-5 syllables.
Note: Haiku is a Japanese poem form and Japanese syllables are longer than their English siblings. Hence, many believe 17 English syllables are too verbose and aim at something around 11 syllables instead. These shorter Haiku are called free form Haiku.
Counting one by one would take ages and isn’t very reproducible. So this is a good job for web scraping. I.e., automated extraction of information from a website. This could be done in various environments and programming languages. For this example I am using R, because it allows for a wide range of consecutive NLP-analyses without having to implement them myself (Python would be a good alternative as well).

Preparations

Get the needed R packages
library(tidyverse)  # because I love tidy data and %>%
library(rvest)  # for the actual webscraping
library(lubridate)  # for date handling
library(stringr)  # for working with strings
Save URL of first page as a string
start_page <- "http://www.dailyhaiku.org/haiku/"

Step by Step

Get the CSS-selector/XPath

From the webpage we only want to extract the Haiku and the link to the previous archive page, so we need some way to identify these elements. The browser add-on SelectorGadget comes in handy for this task. By pointing and clicking it extracts the CSS-selector and the XPath of the selected object(s), which represent ways to address elements within a html document.
Let’s start with the Haiku blocks.
Figure 1: Selection of the Haiku blocks
We see in Figure 1 that the first Haiku block, that is the one I’ve clicked on, is highlighted in green. The others are highlighted in yellow, which means that they will be selected as well. If we don’t want that, we can click on one of them. This would update the selection with the others removed. In the present case, however, the default behavior is exactly what we want. In the bottom we see the CSS-selector for the Haiku Blocks (if we prefer XPath we can change the address format). Let’s store it for later use.
css_haiku <- c(".haiku-archive-list")
Next we need the link for previous Haiku.
Figure 2: Selection of the link to previous Haiku
This time the first click selected several links (e.g., to the authors), so I had to deselect them as indicated by the red highlight (see Fig. 2). Let’s store the resulting CSS-selector.
css_link <- c(".prev-next a")

Extract the Information

Now that we have the CSS-selectors we can try to extract their corresponding information from the webpage. Let’s start by downloading the first page of the archive.

Extract Haiku Information

first_page <- read_html(start_page)
From the page we can extract the Haiku blocks.
blocks <- first_page %>% html_nodes(css_haiku)
A look into the first block shows us that it contains three elements. Their class names are - date - haiku - author
which can be used as CSS-selectors.
# peek into the first block
blocks[[1]]
#> {xml_node}
#> <div class="haiku-archive-list">
#> [1] <h3 class="date"><a href="http://www.dailyhaiku.org/haiku/2016-march ...
#> [2] <p class="haiku">balmy breeze<br>\r\nswarming bees circle<br>\r\nthe ...
#> [3] <p class="author"><a href="/contributors/Cycle-20-October-2015-to-Ma ...
Let’s create a functions that extract those information from a block.
Beginning with date.
extract_haiku_date <- function(block) {
    # test whether OS is Windows
    windows <- Sys.info()["sysname"] == "Windows"
    
    # test whether you have an English locale-setting
    loc <- sessionInfo()[["locale"]]
    is_english <- str_detect(loc, "LC_TIME=English")
    
    # extract date with english locale
    if (is_english) {
        date <- block %>% html_node(".date") %>% xml_text() %>% mdy()
    } else if (windows) {
        date <- block %>% html_node(".date") %>% xml_text() %>% mdy(locale = "English_United States.1252")
    } else {
        date <- block %>% html_node(".date") %>% xml_text() %>% mdy(locale = "en_US.utf8")
    }
    
    return(date)
}
Next, a function to extract the text.
extract_haiku_text <- function(block) {
    block %>% html_node(".haiku") %>% xml_text()
}
Next, a function to extract the author’s name.
extract_haiku_author <- function(block) {
    block %>% html_node(".author") %>% xml_text()
}
Finally, put them all together.
extract_haiku <- function(block) {
    list(date = extract_haiku_date(block), text = extract_haiku_text(block), 
        author = extract_haiku_author(block))
}
Now, try it on the whole block-set.
block_info <- blocks %>% map(extract_haiku) %>% map_df(~as.list(.))

knitr::kable(block_info, format = "html", booktabs = TRUE)
date text author
2016-03-26 balmy breeze swarming bees circle the river bank Polona Oblak
2016-03-25 whitecaps on the bay the overhead cries of migrating birds Polona Oblak
2016-03-24 deep within the lettuce a slug shaped hole Polona Oblak
2016-03-23 siskins’ trill sunlight floods the bare larch Polona Oblak
2016-03-22 drifting fog an acquaintance i’d rather not meet Polona Oblak
2016-03-21 ferry harbour customs area swifts dart through the heat Polona Oblak
2016-03-20 frost lingers i clean brussels sprouts in the kitchen sink Polona Oblak

That looks good! If you look at block_info in an unformatted table you might notice several “\r\n” in the text - those are line breaks. Depending on the task at hand we might need to remove them. For the time being I just leave them untouched.

Get all Haiku in the Archive

We build a little wrapper to parse all pages. Go get yourself a coffee after you start this chunk of code - it will take a while. By reducing or removing the Sys.sleep you can speed up things, but on the downside, this might get you banned. Furthermore, IMHO it is common courtesy not to hog the server all for yourself. Alternatively, you can download the result here. After downloading the file into your working directory, just type load("haiku.Rdata") in your R-console.
# run only if haiku data is not already present
if (!file.exists("haiku.RData")) {
    page_url <- start_page
    
    haiku_list <- list()
    counter <- 0
    
    while (page_url != "stop") {
        counter <- counter + 1
        
        page <- read_html(page_url)
        
        
        haiku_list[[counter]] <- page %>% html_nodes(css_haiku) %>% map(extract_haiku) %>% 
            map_df(~as.list(.))
        
        
        # little break between page calls to avoid getting banned
        Sys.sleep(5)
        
        # get the next page's url
        links <- page %>% html_nodes(css_link)
        older_nr <- which(str_detect(links, "Older"))
        if (length(older_nr) == 0) {
            page_url <- "stop"
        } else {
            page_url <- links[older_nr] %>% html_attr("href")
        }
        
    }
    # combine list
    haiku <- bind_rows(haiku_list)
    
    # save for later use
    save(haiku, file = "haiku.RData")
    
} else load("haiku.RData")
Note: If you are downloading larger amounts of data, then it might be a good idea to save regularly to your disk. Otherwise you have to start anew everytime there is a problem with your internet connection.
To check how many Haiku we got overall, we can look at the number of rows in the resulting data-frame.
n <- nrow(haiku)
n
#> [1] 3633
Wow - so there were 3633 Haiku in the archive. That is more than I was expecting. But how many of them are in the traditional style? To classify the Haiku we need to count the number of syllables. 17 means classical, fewer means free form. For the syllable counting - a non-trivial task - we make use of the syllable package.
haiku <- haiku %>% mutate(syllables = syllable::compute_syllable_counts(text))
Let’s take a look at the distribution of syllable count per Haiku.
haiku %>% ggplot(aes(x = syllables)) + geom_histogram(color = "black", fill = "white")

This result was somewhat unexpected. Some entries are too short (I would consider 2 + 3 + 2 = 7 as the lower bound even for the free form) and some are too long (as mentioned earlier the long traditional form has 17 syllables). So what happened?
For the short ones multiple explanations are possible: 1. something went wrong in the web scraping 2. errors on the website 3. the entries are not Haiku 4. errors in the syllables counting 5. digits and special characters were used as words (and were not recognized for syllables counting) 6. something else
Best we take a quick look at the shorter entries.
haiku %>% filter(syllables < 6) %>% knitr::kable(format = "html", booktabs = TRUE)
date text author syllables
2015-05-23 t h e    puck   !@#    $%&  
*+“?    drops
LeRoy Gorman 3
2015-05-22 snowinterubric LeRoy Gorman 5
2013-06-12 goinggoinggoinggon      e Alan S. Bridges 4
2013-04-12 starving the darkness LeRoy Gorman 5
2012-03-17 full bloom, full stop! Rafal Zabratynski 4
2012-01-31 ki                                   3… 2… 1… 2012!                                   ss Rafal Zabratynski 1
2009-04-08 pond frogs     moon bats George Swede 4
2008-12-30 the           smell           of snow Carol Pearce-Worthington 4
2007-02-04 Just Sitting Around Bryak Webster 5

Well, I’m all but an expert and I’m happy to be convinced otherwise, but the entries don’t seem like Haiku to me.
The potential errors for the too long entries are similar. Best to check them anyways.
haiku %>% filter(syllables >= 20) %>% head() %>% knitr::kable(format = "html", 
    booktabs = TRUE)
date text author syllables
2016-03-03 another glass of champagne— remembering what i choose to remember Sondra J. Byrnes 20
2014-11-15 ice in the driveway she slips into something more comfortable kjmunro 20
2014-01-08 altocumulus undulating above the town the chime of church bells J. Zimmerman 20
2013-06-11 shuttle-bus-driver our daily conversation about the lottery numbers Alan S. Bridges 20
2011-12-21 Dear Malvina,It’s been a long time since we It’s already autumn here . . . lonely evening Rafal Zabratynski 27
2009-08-12 gallery opening the couple in the lobby kissing artistically Megan Arkenberg 20

Well, for my taste they are too long. For the given task I will exclude both the extremely short and those with more than 17 syllables.
Note: This is real world data and messy and irregular for that. Some kind of clean up is necessary in the most scenarios. In this case I just remove the irregularities, but in other scenarios a different approach might be preferable. E.g. If the too short Haiku were just too short because of the usage of digits instead of written out numbers, then replacing those digits by proper words might be a better choice.
haiku_clean <- haiku %>% filter(syllables >= 7) %>% filter(syllables <= 17)
Next, we identify those in the traditional form.
haiku_clean <- haiku_clean %>% mutate(form = ifelse(syllables == 17, "traditional", 
    "free"))
Finally, we can try to answer the question: “Which percentage of Haiku is written in the traditional form?”
haiku_clean %>% group_by(form) %>% summarise(n = n()) %>% mutate(percentage = n/sum(n) * 
    100)
#> # A tibble: 2 × 3
#>          form     n percentage
#>         <chr> <int>      <dbl>
#> 1        free  3041  94.294574
#> 2 traditional   184   5.705426
Only 5.7% of the Haiku on DailyHaiku comply with the traditional form. Obviously, the contributors appreciate the English free form Haiku more.

Closing Remarks

Of course this was only a first peek into web scrapping, but I hope it helps you start building your own solution. If you want to dive deeper into web scrapping:
If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] stringr_1.2.0   lubridate_1.6.0 rvest_0.3.2     xml2_1.1.1     
#>  [5] dplyr_0.5.0     purrr_0.2.2     readr_1.1.0     tidyr_0.6.1    
#>  [9] tibble_1.2      ggplot2_2.2.1   tidyverse_1.1.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.9      formatR_1.4      highr_0.6        plyr_1.8.4      
#>  [5] forcats_0.2.0    tools_3.3.2      digest_0.6.12    jsonlite_1.2    
#>  [9] evaluate_0.10    nlme_3.1-131     gtable_0.2.0     lattice_0.20-34 
#> [13] psych_1.6.12     DBI_0.5-1        yaml_2.1.14      parallel_3.3.2  
#> [17] haven_1.0.0      httr_1.2.1       knitr_1.15.1     hms_0.3         
#> [21] rprojroot_1.2    grid_3.3.2       R6_2.2.0         readxl_0.1.1    
#> [25] foreign_0.8-67   rmarkdown_1.4    modelr_0.1.0     reshape2_1.4.2  
#> [29] magrittr_1.5     codetools_0.2-15 backports_1.0.5  scales_0.4.1    
#> [33] htmltools_0.3.5  assertthat_0.1   mnormt_1.5-5     colorspace_1.3-2
#> [37] stringi_1.1.2    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

6 comments:

  1. That was a great deal of information. In respect to the same, is it possible to extract information based on a selected author from the dropdown (without having to extracting all archived data then sub-setting it by author)

    ReplyDelete
    Replies
    1. Sure - you just have to adjust the URL. Make your selection, look at the new address, and change your start url accordingly.

      Delete
  2. That was quick reply. Actually I got stuck there. Trying to change the selection does not change the url. The url remains the same but the data changes. In addition, the selection field is not within a form that could have allowed the use of rvest::html_form(). I think the solution will lie in within the html_session() context but there is not guide in that

    ReplyDelete
    Replies
    1. Hm.. That's strange. When I select the first author in the list the url changes to http://www.dailyhaiku.org/?s=haiku&c=Aaron-Marko

      The parameter c=Name-Surname seems to me what you are looking for

      Delete
  3. Sorry i realized i was trying to apply the functions to the url "https://tradingeconomics.com/" instead of the url you applied, though with correct parameters

    ReplyDelete
  4. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    teX-ai

    Sentiment Analysis Tool

    ReplyDelete

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts