Thursday, November 30, 2017

k-Nearest Neighbors from Scratch in R

Hi folks! In this post I will continue work from a previous post and will start with my rationale why I do so with k nearest neighbors. If you are only interested in a how2 of this method, then you can skip the next few paragraphs.

OK, so you are here for the whole story - excellent. In my last post Finding Vienna’s House Price Regions. Regionalization with R and ClustGeo we created spatially constraint clusters of sold single-family and duplex houses in Vienna, Austria in 2016 based on their price per square meter. But for 20% of the homes we did not assign any cluster, because we put those aside as a test set. The test set will come in handy when we want to evaluate how good the models work on data that is new to them. To use the cluster membership in any of those future models as a feature/predictor, we first have to come up with a way to allocate new cases to those cluster.

If we had not used clustering based on soft spatial constraints, which allow a little overlap of clusters, but hard constraints with clear cut borders between clusters, then we could have used those borders to describe the clusters as polygons and just check into which polygon the new case falls. But, in absence of clear cluster boundaries we have to come up with another way. Still, although not perfectly, the clusters are quite spatially coherent with not to much overlap, so some kind of assignment based on the proximity to other cluster members seems reasonable.

In example we could calculate the distances between a new case and all homes which already have a cluster membership and use those to look up the shortest distances. Then we can assign the same membership to the new case as the nearest known home or choose the majority membership of the k nearest known homes (maybe weighted by their distance). In a nutshell this is what k-nearest neighbors (k-NN) classification does. Well, and it is what we will try to accomplish in today’s post.

Preparations

Before the fun begins some preparations are needed.

Packages

First, we will load the packages we will work with.

Load Basic Data-Wrangling Packages

As you probably already know I am a huge tidyverse fan-boy, so …

library(tidyverse)
library(magrittr)

Data

Next, we need data to work with.

Get the Data

We will use the result from my last post - if you have missed that one, then you can either download the data here and load it with readr::read_csv() or use the code below for your convenience.

if (!exists("dat_raw")){
  if (!file.exists("data_result.csv")){
    res <- tryCatch(download.file(
        "https://dl.dropboxusercontent.com/s/4jq1vjbxyt8dy7t/data_result.csv?dl=0",
        "data_result.csv", mode = "wb"),
      error=function(e) 1)
  }

  
  dat_raw <- read_csv("data_result.csv")
}

Quickly, take a look at the data to see if loading and parsing was successful.

dat_raw
#> # A tibble: 6,749 x 48
#>    KGCode Katastralgemeinde    EZ   PLZ                  Straße    ON
#>     <chr>             <chr> <int> <int>                   <chr> <int>
#>  1  01205          Hietzing   290  1130 hietzinger hauptstrasse    31
#>  2  01512      Unterdöbling   355  1190           fuerfanggasse     5
#>  3  01512      Unterdöbling    33  1190           nusswaldgasse     7
#>  4  01502          Grinzing  1471  1190            schreiberweg    31
#>  5  01503     Heiligenstadt   928  1190            amalgergasse     7
#>  6  01511      Salmannsdorf    71  1190  salmannsdorfer strasse    26
#>  7  01215     Unter St.Veit   409  1130            larochegasse    33
#>  8  01209      Ober St.Veit  2462  1130         am meisenbuehel     5
#>  9  01502          Grinzing  1120  1190           hocheneggasse     5
#> 10  01514           Währing  2342  1180        hasenauerstrasse    63
#> # ... with 6,739 more rows, and 42 more variables: Gst <chr>, GstFl <int>,
#> #   ErwArt <chr>, Widmung <chr>, Bauklasse <chr>, Gebäudehöhe <dbl>,
#> #   Bauweise <chr>, Zusatz <chr>, Schutzzone <chr>, Wohnzone <chr>,
#> #   öZ <chr>, Bausperre <chr>, seitbis <chr>, zuordnung <chr>,
#> #   Geschoße <int>, parz <chr>, Erwerbsdatum <chr>, Anteile <chr>,
#> #   Zähler <int>, Nenner <int>, BJ <int>, TZ <chr>, Kaufpreis <dbl>,
#> #   mGfl <dbl>, Baureifgest <chr>, percentWidmung <int>, Baurecht <chr>,
#> #   Bis <chr>, aufEZ <int>, Stammeinlage <chr>, sonstwid <chr>,
#> #   sonstwidprz <chr>, Bauzins <chr>, address <chr>, one_more_time <lgl>,
#> #   lon <dbl>, lat <dbl>, price_m2 <dbl>, price_log <dbl>, train <int>,
#> #   c_fine <int>, c_coarse <int>

Seems fine.

We will add an ID to identify the individual cases.

dat_raw <- dat_raw %>%
  mutate(id = row_number())

Select and Split the Data

Today we will only use longitude (´lon´) and latitude (lat) for k-NN, so we can reduce our data set a little. Additional to the coordinates we will need the cluster membership. For now we will use the coarser assignment in c_coarse and leave c_fine for individual training. Lastly, we will split the data in two sets: one with cluster assignments and one without.

dat_select <- dat_raw %>%
  select(id, lon, lat, c_coarse)

dat_train <- dat_select %>%
  filter(!is.na(c_coarse))

dat_new <- dat_select %>%
  filter(is.na(c_coarse))

k-NN

From Scratch

k-NN (done naively, without optimization to speed things up) is quite simple. To understand it a little better we will try to implement it ourselves.

Calculate Distances

First, we have to calculate the distances between the new cases without class memberships and the cases with class memberships. For simplicity we will use euclidean distance, although this is not totally correct (one degree longitude not necessarily equals the same distance as one degree latitude). There are several R packages which could do this for us (e.g., pdist), but we want to do it from scratch.

We start by expanding the data by creating all combinations of dat_train and dat_new

dat_expand <- dat_train %>% 
  crossing(dat_new)

Let’s take a quick glance:

dat_expand
#> # A tibble: 7,300,788 x 8
#>       id      lon      lat c_coarse   id1     lon1     lat1 c_coarse1
#>    <int>    <dbl>    <dbl>    <int> <int>    <dbl>    <dbl>     <int>
#>  1     1 16.29557 48.18678        1     5 16.38846 48.20699        NA
#>  2     1 16.29557 48.18678        1    14 16.34483 48.25785        NA
#>  3     1 16.29557 48.18678        1    16 16.34175 48.26051        NA
#>  4     1 16.29557 48.18678        1    26 16.34530 48.24970        NA
#>  5     1 16.29557 48.18678        1    28 16.29597 48.18167        NA
#>  6     1 16.29557 48.18678        1    29 16.26796 48.15112        NA
#>  7     1 16.29557 48.18678        1    39 16.26407 48.18614        NA
#>  8     1 16.29557 48.18678        1    40 16.26320 48.13987        NA
#>  9     1 16.29557 48.18678        1    60 16.30946 48.24223        NA
#> 10     1 16.29557 48.18678        1    61 16.33922 48.23340        NA
#> # ... with 7,300,778 more rows

Next, we calculate the euclidean distances:

dat_dist <- dat_expand %>%
  mutate(dist = sqrt((lat-lat1)^2 + (lon-lon1)^2))

Find k-NN

Let’s choose k = 5 and filter out the 5 nearest homes with known cluster membership per case without cluster membership. By using -5 as the n argument we get the 5 smallest values.

Notice: %T>% print() prints the tibble without having any impact on the remaining pipeline. For more information check out the marvellous magrittr package.

dat_5nn <- dat_dist %>%
  group_by(id1) %>%
  top_n(-5, dist)  %>%
  arrange(id1) %T>%
  print()
#> # A tibble: 6,765 x 9
#> # Groups:   id1 [1,353]
#>       id      lon      lat c_coarse   id1     lon1     lat1 c_coarse1
#>    <int>    <dbl>    <dbl>    <int> <int>    <dbl>    <dbl>     <int>
#>  1  1223 16.38580 48.20733        3     5 16.38846 48.20699        NA
#>  2  2039 16.38585 48.20376        3     5 16.38846 48.20699        NA
#>  3  2603 16.38925 48.20900        3     5 16.38846 48.20699        NA
#>  4  5705 16.39098 48.20675        3     5 16.38846 48.20699        NA
#>  5  6564 16.38459 48.20334        3     5 16.38846 48.20699        NA
#>  6    62 16.34506 48.25786        2    14 16.34483 48.25785        NA
#>  7   146 16.34387 48.25823        2    14 16.34483 48.25785        NA
#>  8   230 16.34576 48.25751        2    14 16.34483 48.25785        NA
#>  9   420 16.34444 48.25850        2    14 16.34483 48.25785        NA
#> 10   551 16.34512 48.25760        2    14 16.34483 48.25785        NA
#> # ... with 6,755 more rows, and 1 more variables: dist <dbl>

Classify

Now, we are ready for the actual classification. However, we will decide the class membership not by a simple majority vote, but weight the individual votes by the distance. Let’s calculate the weight as 1/distance. Further, we will remove some variables which are not longer needed.

dat_weights <- dat_5nn %>%
  select(-c(lon, lat, id))  %>%
  mutate(dist_w = 1/dist)  

To easily count votes we add an additional group_by layer to count the votes and then choose the maximum. Let’s vote:

dat_classified <- dat_weights  %>%
  group_by(id1, c_coarse) %>%
  summarize(votes = sum(dist_w)) %>% 
  top_n(1, votes) %T>%
  print() 
#> # A tibble: 1,353 x 3
#> # Groups:   id1 [1,353]
#>      id1 c_coarse     votes
#>    <int>    <int>     <dbl>
#>  1     5        3  1659.149
#>  2    14        2 10228.327
#>  3    16        2  3984.920
#>  4    26        2  3460.397
#>  5    28        1  3211.209
#>  6    29        1  2142.197
#>  7    39        1  3384.799
#>  8    40        1  3777.423
#>  9    60        2  7262.791
#> 10    61        2  3218.572
#> # ... with 1,343 more rows

That’s about it. The only thing left to do is copy the new cluster assignments into the original data set:

dat_result <- dat_raw %>%
  full_join(dat_classified %>%
              select(-votes),
            by=c("id" = "id1")) %>%
  mutate(c_coarse = if_else(
    is.na(c_coarse.x),
    c_coarse.y,
    c_coarse.x
  )) %>%
  select(-c(c_coarse.x, c_coarse.y))

Notice: I am not happy with the code above and would be more than happy to hear about a more beautiful pipeline solution from you.

dat_result
#> # A tibble: 6,749 x 49
#>    KGCode Katastralgemeinde    EZ   PLZ                  Straße    ON
#>     <chr>             <chr> <int> <int>                   <chr> <int>
#>  1  01205          Hietzing   290  1130 hietzinger hauptstrasse    31
#>  2  01512      Unterdöbling   355  1190           fuerfanggasse     5
#>  3  01512      Unterdöbling    33  1190           nusswaldgasse     7
#>  4  01502          Grinzing  1471  1190            schreiberweg    31
#>  5  01503     Heiligenstadt   928  1190            amalgergasse     7
#>  6  01511      Salmannsdorf    71  1190  salmannsdorfer strasse    26
#>  7  01215     Unter St.Veit   409  1130            larochegasse    33
#>  8  01209      Ober St.Veit  2462  1130         am meisenbuehel     5
#>  9  01502          Grinzing  1120  1190           hocheneggasse     5
#> 10  01514           Währing  2342  1180        hasenauerstrasse    63
#> # ... with 6,739 more rows, and 43 more variables: Gst <chr>, GstFl <int>,
#> #   ErwArt <chr>, Widmung <chr>, Bauklasse <chr>, Gebäudehöhe <dbl>,
#> #   Bauweise <chr>, Zusatz <chr>, Schutzzone <chr>, Wohnzone <chr>,
#> #   öZ <chr>, Bausperre <chr>, seitbis <chr>, zuordnung <chr>,
#> #   Geschoße <int>, parz <chr>, Erwerbsdatum <chr>, Anteile <chr>,
#> #   Zähler <int>, Nenner <int>, BJ <int>, TZ <chr>, Kaufpreis <dbl>,
#> #   mGfl <dbl>, Baureifgest <chr>, percentWidmung <int>, Baurecht <chr>,
#> #   Bis <chr>, aufEZ <int>, Stammeinlage <chr>, sonstwid <chr>,
#> #   sonstwidprz <chr>, Bauzins <chr>, address <chr>, one_more_time <lgl>,
#> #   lon <dbl>, lat <dbl>, price_m2 <dbl>, price_log <dbl>, train <int>,
#> #   c_fine <int>, id <int>, c_coarse <int>

k-NN with an ready-made implementation

Well, doing it from scratch was not hard, but let’s try it again with one of the numerous k-NN packages out there. Today we will use kknn because it was the first that came up in my Google search.

Let’s load the package:

library(kknn)

And conduct the classification:

fit_kknn <- kknn(
  as.factor(c_coarse) ~ lon + lat,
  train = dat_train,
  test = dat_new,
  k = 5,
  distance = 2,
  kernel = "inv",
  scale = F
)

Notice1: distance = 2 sets the parameter of Minkowski distance to 2 which is equal to the previously used euclidean distance and kernel = "inv" should be equal to our weighting by 1/distance.

Notice2: We needed to wrap c_coarse in as.factor() to get a classification. Without a k-NN regression would have been conducted, treating c_coarse as a continuous variable.

Now, we can easily extract the fitted values:

dat_imp <- dat_new
dat_imp$c_coarse <- fit_kknn$fitted.values %>% as.numeric

While we are at it, let’s get the assignments for c_fine too.

fit_fine <- kknn(
  as.factor(c_fine) ~ lon + lat,
  train = dat_result %>% filter(train==1),
  test = dat_result %>% filter(train==0),
  k = 5,
  distance = 2,
  kernel = "inv",
  scale = F
)

And save them to dat_result.

dat_result$c_fine[dat_result$train==0] <- fit_fine$fitted.values

Check the results

Compare From Scratch with Ready-Made Implementation

Let’s check whether the results match:

all(dat_classified$c_coarse == dat_imp$c_coarse)
#> [1] TRUE

That’s nice - we have not messed up.

Visualize the new cluster assignments

We will need the ggmap package. Let’s load it:

library(ggmap)

Next, we create a bounding box of our coordinates …

bbox <- make_bbox(lon, lat, dat_result)

… and download the required map data.

vienna <- get_stamenmap(bbox, zoom = 13, maptype = c("toner-lite"))

Now, we can create plots of the cluster membership for the known …

plot_train <- ggmap(vienna, extent = "device") +
  coord_cartesian() + 
  geom_point(data = dat_result %>%
               filter(train==1),
             aes(x=lon, y=lat, colour = as.factor(c_coarse)),
             alpha=.5) +
  ggtitle("Train")+
  theme(legend.position="none")

… and new cases.

plot_new <- ggmap(vienna, extent = "device") +
  coord_cartesian() + 
  geom_point(data = dat_result %>%
               filter(train==0),
             aes(x=lon, y=lat, colour = as.factor(c_coarse)),
             alpha=.5)+
  ggtitle("New") +
  theme(legend.position="none")

Let’s put both into one plot together:

gridExtra::grid.arrange(plot_train, plot_new, ncol=1)

That looks good to me - the color regions in both plots align and therefor the new cases have gotten sensible cluster memberships.

Closing Remarks

I hope you enjoyed our short digression on k-nearest neighbors. Of course you can use it for more than just coordinates, but I believe that its application to spatial data is especially illustrative.

Before we end for today just a quick warning: k-NN does NOT scale well. Although, there are some amazingly optimized packages out there, k-NN is not a good fit for large amounts of data.

If you have any questions or comments please post them in the comments section.


If something is not working as outlined here, please check the package versions you are using. The system I used was:

sessionInfo()
#> R version 3.4.2 (2017-09-28)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] ggmap_2.7        kknn_1.3.1       bindrcpp_0.2     magrittr_1.5    
#>  [5] forcats_0.2.0    stringr_1.2.0    dplyr_0.7.4      purrr_0.2.4     
#>  [9] readr_1.1.1      tidyr_0.7.2      tibble_1.3.4     ggplot2_2.2.1   
#> [13] tidyverse_1.2.1  kableExtra_0.6.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.13      lubridate_1.7.1   lattice_0.20-35  
#>  [4] png_0.1-7         assertthat_0.2.0  rprojroot_1.2    
#>  [7] digest_0.6.12     psych_1.7.8       R6_2.2.2         
#> [10] cellranger_1.1.0  plyr_1.8.4        backports_1.1.1  
#> [13] evaluate_0.10.1   httr_1.3.1        RgoogleMaps_1.4.1
#> [16] rlang_0.1.4       lazyeval_0.2.1    readxl_1.0.0     
#> [19] rstudioapi_0.7    geosphere_1.5-7   Matrix_1.2-11    
#> [22] rmarkdown_1.7     proto_1.0.0       labeling_0.3     
#> [25] foreign_0.8-69    igraph_1.1.2      munsell_0.4.3    
#> [28] broom_0.4.2       compiler_3.4.2    modelr_0.1.1     
#> [31] pkgconfig_2.0.1   mnormt_1.5-5      htmltools_0.3.6  
#> [34] gridExtra_2.3     viridisLite_0.2.0 crayon_1.3.4     
#> [37] bitops_1.0-6      grid_3.4.2        nlme_3.1-131     
#> [40] jsonlite_1.5      gtable_0.2.0      scales_0.5.0     
#> [43] cli_1.0.0         stringi_1.1.5     mapproj_1.2-5    
#> [46] reshape2_1.4.2    sp_1.2-5          xml2_1.1.1       
#> [49] rjson_0.2.15      tools_3.4.2       glue_1.2.0       
#> [52] maps_3.2.0        hms_0.3           jpeg_0.1-8       
#> [55] parallel_3.4.2    yaml_2.1.14       colorspace_1.3-2 
#> [58] rvest_0.3.2       knitr_1.17        bindr_0.1        
#> [61] haven_1.1.0

No comments:

Post a Comment

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts