## Sunday, October 29, 2017

### Finding Vienna's House Price Regions. Regionalization with R and ClustGeo

Real estate is all about location! At least this is what I always hear and as far as I have seen it seems to be true. This means prices within a neighborhood should be quite similar and vice versa adjacent properties with similar prices are probably part of the same neighborhood or Grätzel like we call it in Austria.
And that is exactly what I want to try with you today: Identify distinct neighborhoods based on their price per square meter
As data we will use all transactions of single-family and duplex houses within Vienna in 2016. Thanks to Vienna’s open data policy we can download those from the land charge register at Kaufpreissammlung Liegenschaften - Wien. Unfortunately, no transaction data of apartments are freely available. Of course the data exists, but is hidden behind a massive pay-wall (a few Euros per transaction) and therefore completely out of reach for the budget of this blog.
As a technique we will use Regionalization, which is the process of dividing an area into smaller regions that are homogeneous with respect to a specified metric (in our case the square meter price) aka spatially constrained clustering. For this example we will use the R-package ClustGeo, but check out other packages like HiClimR and SPODT for your own project.

# Preparations

## Packages

As you probably already know I am a huge tidyverse fan-boy, so …
library(tidyverse)
library(magrittr)
library(stringr)

## Data

### Get the Data

You can download the 2016 data (labeled 2017) either download here and load it with readr::read_csv2() or use the code below for your convenience.
if (!exists("dat_raw")){
if (!file.exists("kpreis2017.csv")){
"kpreis2017.csv", mode = "wb"),
error=function(e) 1)
}

dat_raw <- read_csv2("kpreis2017.csv", locale = locale(encoding = "Latin1"))
}
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   EZ = col_integer(),
#>   PLZ = col_integer(),
#>   Gst.Fl. = col_integer(),
#>   Gebäudehöhe = col_double(),
#>   Geschoße = col_integer(),
#>   Zähler = col_integer(),
#>   Nenner = col_integer(),
#>   BJ = col_integer(),
#>   Kaufpreis <U+0080> = col_double(),
#>   <U+0080>/m² Gfl. = col_double(),
#>   % Widmung = col_integer(),
#>   auf EZ = col_integer()
#> )
#> See spec(...) for full column specifications.

### Filter the Data

The column names are a little troublesome (e.g., they partly contain the special character “%”), so we have to modify them a little.
names(dat_raw) <- dat_raw %>%
names() %>%
str_replace_all("%", "percent") %>%
str_replace_all("[^[:alpha:]]", "")
Next, we filter out very small (less than 100 square meters) and very large (more than 3000 square meters) properties as well as properties that have prices below 100.000€, because those seem fishy to me. Even for 100.000€ you will not get a house in Vienna. Maybe, those were sold within family or were tax evasion ploys. In any case they do not reflect the real free house market.
dat_clean <- dat_raw %>%
filter(ErwArt == "Kaufvertrag",
zuordnung == "Ein-, Zweifamilienhaus",
GstFl >= 100, GstFl <=3000,
Kaufpreis >= 100000,
str_detect(ON, "^[[:digit:]]*$")) #> Warning: package 'bindrcpp' was built under R version 3.3.3 ### Find Coordinates The data includes only the addresses, but not the coordinates. To add those we will use the Google Maps API via the R-package ggmap. First, we load the package. library(ggmap) Next, we have to change some specific German characters (umlauts and sharp S) to their international form and create a column which contains the whole address to pipe it into geocode(). dat_clean <- dat_clean %>% mutate(Straße = tolower(Straße), Straße = str_replace_all(Straße, "ß", "ss"), Straße = str_replace_all(Straße, "ä", "ae"), Straße = str_replace_all(Straße, "ö", "oe"), Straße = str_replace_all(Straße, "ü", "ue"), address = str_c(Straße, ON, "vienna, austria", sep=", ")) Finally, we retrieve the coordinates.Unfortunately, the number of free queries per day and IP is limited to 2500. Hence, we have to do it in several steps over multiple days. Of course, if you are not cheap like me, then you can pay for the extra queries and be done in a few moments. Another approach would be to switch IP via a VPN service once the query limit is reached. However, this is only a hobby to me and my wife surely is happy, when I’ll take a break from it for a few days so I will do it the long way. WARNING: This code will run for days. You can skip it and download its result here. res1 <- dat_clean %>% filter(row_number() <= 2000) %>% mutate(coord = purrr::map(address, geocode, messaging = FALSE)) Sys.sleep(3600*24) res2 <- dat_clean %>% filter(row_number() > 2000, row_number() <= 4000) %>% mutate(coord = purrr::map(address, geocode, messaging = FALSE)) Sys.sleep(3600*24) res3 <- dat_clean %>% filter(row_number() > 4000, row_number() <= 6000) %>% mutate(coord = purrr::map(address, geocode, messaging = FALSE)) Sys.sleep(3600*24) res4 <- dat_clean %>% filter(row_number() > 6000) %>% mutate(coord = purrr::map(address, geocode, messaging = FALSE)) dat_coord <- bind_rows(res1, res2, res3, res4) write_rds(dat_coord, "dat_coord.rds") Next, we check whether we got all coordinates. dat_coord_temp <- dat_coord %>% unnest() %>% mutate(one_more_time = is.na(lon)) dat_coord_temp %$%  one_more_time %>% sum
#> [1] 2199
Well, that is sobering. Let’s mark all missing coordinates …
dat_coord[["one_more_time"]] <- dat_coord_temp %$% one_more_time … create a function that only queries the coordinates if they are not already present … geocode_cont <- function(address, cont, alt){ if(cont){ geo <- geocode(address, messaging = FALSE) } else { geo <- alt } return (geo) } … and run the search again: dat_coord <- dat_coord %>% mutate(coord = pmap( list(address, one_more_time, coord), geocode_cont)) Let’s check if the rerun worked. dat_coord_temp2 <- dat_coord %>% unnest() %>% mutate(one_more_time = is.na(lon)) dat_coord_temp2 %$%  one_more_time %>% sum
#> [1] 39
OK, I will consider it a success. If you look at the addresses still missing coordinates, then you will realize that those are severely abbreviated. There are not many of these cases so we will not invest any further effort, but simply discard them.

### Some final clean ups and preparations

Remove cases without coordinates
data_coord_noNa <- dat_coord %>%
unnest() %>%
filter(!is.na(lon))
Are there any duplicated cases?
data_coord_noNa %>%
select(lat,lon) %>%
duplicated %>%
sum
#> [1] 1110
Well, that are quite a lot. Further inspection would reveal that some houses were sold multiple times in a year and in other cases multiple parts of one house were sold in the same year. To make things easy will keep only the entry with the highest price per coordinates, but of course you could be more clever about it.
data_coord_noDup <- data_coord_noNa %>%
arrange (desc(Kaufpreis)) %>%
distinct(lat, lon, .keep_all = T)
Next, we calculate the price in Euro per square meter and take a look at its distribution (we have to correct this value by the bought proportion, which is not 100% in all cases ).
Notice: Although price per square meter seems like a good feature at first glance, it is in fact rather troublesome in this data-set. We only have the overall area of the sold property, but not the size of the living area. Hence, we need to treat garden and house the same way, although they most certainly have a different impact on the price. If you have an easy solution for this problem, then please let me know (estimating the area of the house via the area of the roof in google maps does not qualify as easy).
data_coord_price_m2 <- data_coord_noDup %>%
mutate(price_m2 = Kaufpreis/GstFl *
#correct by portion
(Zähler/Nenner))

data_coord_price_m2 %>%
ggplot(aes(x=price_m2)) +
geom_histogram() +
ggthemes::theme_tufte()
Obviously, there are some extremely expansive houses in our data (this is indicated by the seemingly much wider than needed x-axis). Since we are not interested in special cases we remove those and try again.
data_coord_low <- data_coord_price_m2 %>%
filter(price_m2 <= 10000)

data_coord_low %>%
ggplot(aes(x=price_m2)) +
geom_histogram() +
ggthemes::theme_tufte()
This looks more like it, but still a few high price single cases are in the data. Filtering out everything above 3000€/m² should get rid of those. However, the remaining cases will be right skewed. This is something we often see when money is concerned. There are many with few and few with many. For the planned clustering it might be sensible to weight differences in the lower price range more heavily than the same difference in higher monetary region. At least I believe it to be, but that is one thing we can surely argue about. To do so we will log-transform the prices.
data_coord_lower <- data_coord_low %>%
filter(price_m2 <= 3000) %>%
mutate(price_log = log2(price_m2))

data_coord_lower %>%
ggplot(aes(x=price_log)) +
geom_histogram() +
ggthemes::theme_tufte()
OK, this revealed some outliers, which are too cheap. After we remove them we should be good to go.
data_coord_clean <- data_coord_lower %>%
filter(price_log > 5)
Finally, we will add an indicator to separate our sample into a training and a test set (we will split 80:20).
set.seed(1234)

data_coord_final <- data_coord_clean %>%
mutate(train = rbinom(n(), 1, 0.8))

data_coord_final %$% table(train) #> train #> 0 1 #> 1353 5396 # Spatially Constrained Clustering As laid out in the intro we will use ClustGeo for the spatially constrained clustering. In fact, ClustGeo offers only soft spatial constraints. This means that the resulting clusters are not necessarily 100% spatially cohesive, but might be intertwined to a certain degree. We start by loading the package. library(ClustGeo) #> Warning: package 'ClustGeo' was built under R version 3.3.3 ## Calculate Distances hclustgeo, the main function of ClustGeo, needs two distance matrices D0 and D1. D0 represents the distances for the criteria in question (in our case €/m²) and D1 the spatial distances. Let’s calculate both. #D0 = D_price D_price <- dist(data_coord_final %>% filter(as.logical(train)) %$% price_log)

#D1 = D_spatial
D_spatial <- dist(data_coord_final %>%  filter(as.logical(train)) %>% select(lon,lat)) 

Notice: Changes in one degree longitude and one degree latitude do not necessarily reflect the same distance in meters and hence the spatial distance matrix is not perfectly correct. Do get more precise estimations we could use the package geosphere. Or, given that we only have coordinates within a small region, we could center all values and scale the longitude to match the latitude. To keep it simple we will forgo both options.

## Select Mixing Parameter

To create spatially constrained clusters hclustgeo combines both distances. The function choicealpha helps to find the optimal mix by calculating the proportion of explained inertia of the partitions in K clusters. When the proportion of explained inertia based on D0 decreases, the proportion of explained inertia based on D1 increases. In a nutshell we want to find the mixing parameter α that maximizes the explained inertia of both distance Matrices. Because we do not know K yet we explore a range of values in case the optimal α changes with the number of clusters.
We start by calculating choicealpha for a range of α and K:
Note: This will probably take some time.
range.alpha <- seq(0,1,0.1)
alpha <- tibble(K = c(5, 10, 25, 50))

alpha <- alpha %>%
mutate(cr = map(K, function(x) choicealpha(D_price, D_spatial, range.alpha, x ,graph=FALSE)))

alpha
#> # A tibble: 4 x 2
#>       K                cr
#>   <dbl>            <list>
#> 1     5 <S3: choicealpha>
#> 2    10 <S3: choicealpha>
#> 3    25 <S3: choicealpha>
#> 4    50 <S3: choicealpha>
Next, we extract the tables of the proportion of explained pseudo inertia (Q), calculate the sum of both proportions, scale the sum with its maximum per K.
alpha_q <- alpha %>%
mutate(q = map(cr, function(x) as_tibble(x$Q))) %>% select(-cr) %>% unnest() %>% mutate(alpha = seq(0,1,0.1) %>% rep(4), q_sum = Q0 + Q1) %>% group_by(K) %>% mutate(q_scaled = q_sum/max(q_sum)) %>% ungroup() %>% mutate_all(function(x) round(x,2)) alpha_q #> # A tibble: 44 x 6 #> K Q0 Q1 alpha q_sum q_scaled #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 5 0.90 0.04 0.0 0.95 0.68 #> 2 5 0.76 0.63 0.1 1.39 1.00 #> 3 5 0.69 0.70 0.2 1.39 1.00 #> 4 5 0.71 0.69 0.3 1.40 1.00 #> 5 5 0.56 0.76 0.4 1.32 0.94 #> 6 5 0.37 0.81 0.5 1.18 0.84 #> 7 5 0.10 0.88 0.6 0.97 0.70 #> 8 5 0.08 0.86 0.7 0.95 0.68 #> 9 5 0.10 0.86 0.8 0.96 0.69 #> 10 5 0.08 0.87 0.9 0.95 0.68 #> # ... with 34 more rows We restructure the data for readability of proportion sum per α - K combination: alpha_q %>% select(-Q0, -Q1, -q_sum) %>% spread (alpha, q_scaled) #> # A tibble: 4 x 12 #> K 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 #> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 5 0.68 1.00 1 1 0.94 0.84 0.70 0.68 0.69 0.68 0.70 #> 2 10 0.64 0.99 1 1 0.97 0.93 0.85 0.77 0.76 0.68 0.67 #> 3 25 0.58 1.00 1 1 0.99 0.97 0.94 0.91 0.88 0.74 0.63 #> 4 50 0.56 1.00 1 1 0.99 0.98 0.97 0.95 0.92 0.85 0.62 Both, α = .2 and α = .3 are excellent choices for all tested K values. However, for now we will choose α = .6 because it performs very well and in this case spatial separation (explained inertia based on D1, hence a larger α) is more important than overly homogeneous prices within the clusters (explained inertia based on D0 - small α). ## Get the Clusters Let’s perform the cluster analysis: tree <- hclustgeo(D_price,D_spatial,alpha=0.6) ### Select the number of clusters Now, we have the tree produced by the clustering process. The question is where we should cut it and therefore how many clusters we should extract. This is all but an easy task and no clear “this is the way you do it” exists. Lastly, everything boils down to interpretability and usefulness of your solution. That said a great number of statistical methods to determine the number of clusters exists. Check out the package NbClust for R implementations of many of those. Since we use two distance matrices I am not sure which of those are applicable and sensible for our case (please let me know if you know more). So I will assume that cuts are sensible before big jumps in the clustering height, because those represent agglomerations of more dissimilar clusters than in the steps before. First, we calculate the ration of the new height per step in relation to the previous tree height: growth <- tibble(height = tree$height)%>%
mutate(growth_factor = height/lag(height)) 
Next, we look for big jumps. As an arbitrary value I have chosen steps that increase the height by more than 25%.
growth %>%
arrange(desc(height)) %>%
mutate(clusters = row_number()) %>%
filter(growth_factor >1.25)
#> # A tibble: 6 x 3
#>         height growth_factor clusters
#>          <dbl>         <dbl>    <int>
#> 1 3.386179e-02      8.178521        1
#> 2 2.670426e-03      1.816653        4
#> 3 1.469971e-03      1.404072        5
#> 4 5.414649e-04      1.415405       11
#> 5 5.831840e-11      1.640507     5387
#> 6 3.017663e-11      1.522418     5393
This returns 6 jumps that fit the description. The first two are pretty early and cutting the tree at that points would result in over 5000 cluster which is not useful. The next one is more reasonable and indicates 12 clusters. This seems like a realistic number of price regions for single-family and duplex houses within Vienna. In addition to this fine grained clustering we will extract the coarser solution of 5 clusters as well.
clust_fine <- cutree(tree, 12)
clust_coarse <- cutree(tree, 5)
Let’s save the cluster memberships back to the original data.
clust_member <- bind_cols(
data_coord_final %>%
filter(as.logical(train)) %>%
select(lat, lon),
c_fine = clust_fine,
c_coarse = clust_coarse
)

data_result <- data_coord_final %>%
left_join(clust_member)
#> Joining, by = c("lon", "lat")

## Get to know the Clusters

By exploring the clusters we can check whether the spatially constrained clustering worked and the solution makes sense. We will do it exemplary for the 5 clusters solution.

### Cluster descriptives

We start by calculating some summary statistics.
clust_summary <- data_result %>%
filter(as.logical(train)) %>%
group_by(c_coarse)  %>%
summarize(
price_mean = mean(price_m2),
price_sd = sd(price_m2),
center_lon = mean(lon),
center_lat = mean(lat),
count = n()
) %>%
mutate_at(vars(starts_with("price")),
function(x) round(x))

clust_summary
#> # A tibble: 5 x 6
#>   c_coarse price_mean price_sd center_lon center_lat count
#>      <int>      <dbl>    <dbl>      <dbl>      <dbl> <int>
#> 1        1        638      420   16.26699   48.17557  1920
#> 2        2        844      566   16.31388   48.23865   660
#> 3        3        458      250   16.35561   48.15771   874
#> 4        4        435      282   16.49902   48.22482  1154
#> 5        5        493      364   16.41976   48.27057   788
We see that cluster 2 is the most expansive followed by cluster 1. Between the remaining three clusters there is less of a price difference.

### Visualize the clusters

To visualize our result we will once again use the ggmap package, which we have already loaded for the coordinate look-up.
We start by creating a bounding box of our coordinates.
bbox <- make_bbox(lon, lat, data_result)
vienna <- get_stamenmap(bbox, zoom = 14, maptype = c("toner-lite"))
Finally, we plot the map and add the houses as points mapping cluster membership to color and size to the price per square meter. If everything went right, then equally sized and colored bubbles should be grouped together.
Let’s try if it checks out:
Notice: Plotting all more than 5000 houses would be a mess, hence we will sample only a few cases per cluster.
set.seed(12345)
ggmap(vienna, extent = "device") +
coord_cartesian() +
geom_point(data = data_result %>%
filter(as.logical(train)) %>%
group_by(c_coarse) %>%
sample_n(200),
aes(x=lon, y=lat, size = price_m2, colour = as.factor(c_coarse)),
alpha=.6) +
guides(size=guide_legend(title="€/m²"),
colour=guide_legend(title="Cluster"))
At least to me the result seems pretty reasonable. The most expensive cluster 2 reflects the districts 19 and 18 as well as the outskirts of 17, 16, and 15. Next in the price comes the 14th and 13th district represented by cluster 1. The clusters 4 and 5 are more or less equal to the districts 21 and 22, respectively, with cluster 5 being slightly more expensive. Finally, cluster 3 seems to represent the 10th district and its neighbors. As was to be expected hardly any single-family or duplex houses are sold in the center districts, because there are not many of those. What puts me a little off is that seemingly prices of such homes in the center of Vienna are not that high and are therefore included in the mid-price cluster 3. Although it aligns with the data (those bubbles really are not that big), it contradicts the little experiences with real estate prices in Vienna I have. I am thankful for every reasonable explanation of this conundrum.

# Closing Remarks

I hope you enjoyed our short digression on regionalization. In future blog posts we will explore how we can use the clusters and whether they are worth the effort. So please come back soon or subscribe to this blog to never miss a post.

If something is not working as outlined here, please check the package versions you are using. The system I used was:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] ClustGeo_2.0     ggmap_2.7        bindrcpp_0.2     stringr_1.2.0
#>  [5] magrittr_1.5     dplyr_0.7.2      purrr_0.2.2.2    readr_1.1.1
#>  [9] tidyr_0.6.3      tibble_1.3.3     ggplot2_2.2.1    tidyverse_1.1.1
#> [13] kableExtra_0.3.0
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.12      lubridate_1.6.0   lattice_0.20-34
#>  [4] deldir_0.1-12     gtools_3.5.0      png_0.1-7
#>  [7] assertthat_0.2.0  rprojroot_1.2     digest_0.6.12
#> [10] psych_1.6.12      R6_2.2.0          cellranger_1.1.0
#> [13] plyr_1.8.4        backports_1.1.0   evaluate_0.10
#> [19] rlang_0.1.1       lazyeval_0.2.0    spdep_0.6-10
#> [25] gmodels_2.16.2    Matrix_1.2-8      rmarkdown_1.6
#> [28] proto_1.0.0       labeling_0.3      splines_3.3.2
#> [31] foreign_0.8-67    munsell_0.4.3     broom_0.4.2
#> [34] modelr_0.1.1      pkgconfig_2.0.1   mnormt_1.5-5
#> [37] htmltools_0.3.5   expm_0.999-1      MASS_7.3-45
#> [40] bitops_1.0-6      grid_3.3.2        nlme_3.1-131
#> [43] jsonlite_1.2      gtable_0.2.0      scales_0.4.1
#> [46] stringi_1.1.5     mapproj_1.2-5     reshape2_1.4.2
#> [49] LearnBayes_2.15   ggthemes_3.4.0    sp_1.2-4
#> [52] xml2_1.1.1        boot_1.3-18       rjson_0.2.15
#> [55] tools_3.3.2       forcats_0.2.0     glue_1.1.1
#> [58] maps_3.1.1        hms_0.3           jpeg_0.1-8
#> [61] parallel_3.3.2    yaml_2.1.14       colorspace_1.3-2
#> [64] rvest_0.3.2       knitr_1.16        bindr_0.1
#> [67] haven_1.1.0