## Friday, June 23, 2017

### Visualizing Vienna's Apartment Prices

Hi folks!

I’m sorry for making you wait for this post. At the moment my wife and I are looking for an apartment to buy and this is hard work in Vienna (the city we live in), so I have to cut my time on blogging short. Still, there is something I would like to share with you - how you can use data science to get a quick overview on prices.

To be concrete, I will

1. retrieve apartment prices from a real estate platform(ImmobilienScout24) and
2. visualize prices on a map of Vienna

In doing so, we will learn about how to use an API to retrieve data, how to visualize spatial data with the R package ggmap, and lastly where in Vienna apartments cost how much.

# Preparation

First, we will load the central libraries for this post. In some code-blocks I will use functions from other libraries or I will want to emphasize the package - in those cases I will call the respective function directly by package_name::function_name().

## for general data wrangling
library(tidyverse)

## for getting data
library(httr)
library(jsonlite)

## for visualizing spatial data
library(ggmap)

# Using an API to get data from the net

## Getting to know the API and the returned data structures

An API (application programming interface) is a set of clearly defined methods of communication between various software components. Some content providers on the internet offer APIs and that makes web scrapping so much easier (for comparison see my post on scrapping of unstructured data).

When it comes to Austrian real estate websites I have only found an API for ImmobilienScout24. To gain access to it I had to apply for it with a short email outlining what I plan to do with the data and agree to their terms of usage (e.g., include their logo). Since they were OK with my plans I have received an API key and a short manual for the API.

The instructions in the manual are easy enough. Basically, we just need to customize a base URL to our specific query and need to attach the key for authorization in the header. Further, the manual included an example query. We will start with trying out this example query to see what normal results look like. Note: I can’t give you my key, but I will provide examplary data later on, so that you can follow along on your own. If the data is not enough and you want to try the API yourself - just contact ImmobilienScout24, my contact was always very friendly and helpful (further, you would not provide an API if you don’t want people to use it).

So, let’s try it out!

## Assign your key as a character string!
##
## You don't have to store it in a separate file
## in your own private projects

file.info("key.txt")$size) url <- "https://api.immobilienscout24.at/v2/resultlist?propertyTypes=OFFICE&marketingTypes=RENT&price.max=1000&searchRadius.center.latitude=48.215523&searchRadius.center.longitude=16.376381&searchRadius.distance=2" res <- GET(url, add_headers(Authorization = api_key)) stop_for_status(res) stop_for_status() didn’t inform us about any connection error, so it seems to work. Let’s see what the results look like to get a feeling for them. We will start with the most basic information - the names of the content objects. names(content(res)) #> [1] "listings" "totalNumberOfListings" "searchCriteria" #> [4] "totalNumberOfPages" We see that four objects are included. What we are looking for (prices and geo locations) is most likely in listings as the other objects seem to represent meta-data of our query. So let’s peek into listings. listings <- fromJSON(content(res, "text"))$listing
names(listings)
#>  [1] "address"                "area"
#>  [3] "description"            "geocoordinate"
#>  [5] "commercialPropertyType" "id"
#>  [7] "marketingType"          "price"
#>  [9] "priceDetails"           "propertyType"
#> [11] "title"                  "url"
#> [13] "pictureId"              "rooms"
#> [15] "titlePicture"

Spot on! This seems like what we are looking for. However, just one column for address seems a little suspicious to me, so we will look at it a little closer.

listings[["address"]]
countryCode city showAddress zipCode country state street
AT Wien TRUE 1030 Österreich
AT Wien TRUE 1010 Österreich Wien
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e
AT Wien, Leopoldstadt TRUE 1020 Österreich Wien Walcherstra&szlig;e

OK, what we have here is a nested structure - a data frame within a data frame. This is quite typical for JSON files, which are a very common data format in browser/server communication. To make the data more easily accessible with R we will flatten it.

listings_flat <- jsonlite::flatten(listings) %>%
tbl_df ()
names(listings_flat)
#>  [1] "area"
#>  [2] "description"
#>  [3] "commercialPropertyType"
#>  [4] "id"
#>  [5] "marketingType"
#>  [6] "price"
#>  [7] "propertyType"
#>  [8] "title"
#>  [9] "url"
#> [10] "pictureId"
#> [11] "rooms"
#> [19] "geocoordinate.latitude"
#> [20] "geocoordinate.longitude"
#> [21] "geocoordinate.distanceFromReference"
#> [22] "priceDetails.priceOnRequest"
#> [23] "titlePicture.url"
#> [24] "titlePicture.titlePicture"
#> [25] "titlePicture.id"

This looks good. Let’s take a look at the first row of data to get a better feeling for what each variable represents. Note: for legibility we will transpose the output, i.e. we will convert the first row into a column.

t(listings_flat[1,])
variable 1
area 387.07
description

bestehend aus ca. 387 m&sup2; Universalfl&auml;chen (Gesch&auml;fts-/Ausstellungsfl&auml;chen, B&uuml;rofl&auml;chen usw.), sehr gro&szlig;e Schaufenster, bester Gesamtzustand, beste Verkehrsanbindung, nahe Fasanplatz, unmittelbare Nachbarschaft zu renommierten Einkaufs- und Verbraucherm&auml;rkten, &ouml;ffentliche Anbindung: U Bahn, Stra&szlig;enbahn, Miete

Objekt: 050/001706

Preis auf Anfrage!

Weitere Objekte finden Sie auch auf unserer Webseite: http://www.kanzlei-novo.at/objekt/050-01706
commercialPropertyType OFFICE
id 594925233cdc5a7f1738604b
marketingType RENT
price 0.0
propertyType OFFICE
title Büroflächen/Firmensitz 1030 Wien
url https://www.immobilienscout24.at/expose/594925233cdc5a7f1738604b
pictureId
rooms
geocoordinate.latitude 48.20174
geocoordinate.longitude 16.39159
geocoordinate.distanceFromReference 1.9009722
priceDetails.priceOnRequest TRUE
titlePicture.url
titlePicture.titlePicture
titlePicture.id

Well, that looks good. The only thing bugging me is that the encoding for string variables seems to be a little off. However, at first glance I would guess it is only the html encoding causing problems, which could be resolved quickly. But, since we will not need these variables we will just let them be.

Now, that we have a better understanding of the API and the data it provides, let’s get the data we are looking for.

## Get the data we need

We start by adjusting the query accordingly. Therefore we have to change

• propertyTypes to APARTMENT (we are looking for an apartment not an office)

remove

• price.max=1000,

and increase the

• searchRadius.distance to 20 (that is the approximate radius of the state of Vienna) and
• paging.pageSize to 100 (number of results per query, the default is 10).

The geo coordinates (latitude and longitude) already point to the center of Vienna and can remain unchanged.

Let’s query with these changes.

base_url <- "https://api.immobilienscout24.at/v2/resultlist?propertyTypes=APARTMENT&marketingTypes=BUY&searchRadius.center.latitude=48.215523&searchRadius.center.longitude=16.376381&searchRadius.distance=20&paging.pageSize=100"

base_res <- GET(base_url,
base_res
#> Response [https://api.immobilienscout24.at/v2/resultlist?propertyTypes=APARTMENT&marketingTypes=BUY&searchRadius.center.latitude=48.215523&searchRadius.center.longitude=16.376381&searchRadius.distance=20&paging.pageSize=100]
#>   Date: 2017-06-21 13:07
#>   Status: 200
#>   Content-Type: application/json; charset=utf-8
#>   Size: 358 kB

base_res now includes the first 100 results to our query, but we need all of them. So we have to look at the meta-data, which tells us how many more result pages to our query are available.

pages <- content(base_res)$totalNumberOfPages pages #> [1] 108 We put the whole sequence into a data-frame … vienna_buy <- tibble(page = seq_len(pages)) … so that we can iterate over it with the additional query attribute paging.pageNumber. Note: I have put the query into an if clause to avoid causing unnecessary traffic by quering multiple times while I write this blog. Note: If you don’t have your own API-key, then you can download the RData file here to follow along the rest of the post. if (!file.exists("vienna_buy.RData")){ vienna_buy <- vienna_buy %>% mutate(res = map(page, function (x) GET(paste0(base_url,"&paging.pageNumber=",x), add_headers(Authorization = api_key)))) save(vienna_buy, file="vienna_buy.RData") } else{ load("vienna_buy.RData") }  Next, we have to flatten the JSON structure. vienna_buy <- vienna_buy %>% mutate(listing = map(res, function(x) fromJSON(content(x, "text"))$listing),
listing_flat = map(listing,
function(x) jsonlite::flatten(x) %>% tbl_df ()))

Further, we have created our own nested structure. Each result page is its own little data-frame within a common data-frame. To combine them all we use purrr::unnest().

tbl_vienna_buy <- vienna_buy %>%
unnest(listing_flat)

Now, the ImmobilienScout24 data has an easily usable structure.

# Visualize the spatial data

## Preparations

For the visualization some further data wrangling is required. E.g., the query yielded all locations within the given radius. I.e., our locations are within a circular border, our graphs, however, will be rectangular. Hence we will crop our data a little. In google maps (just click on the location and the coordinates will show up) I have found the (approximate) utmost coordinates of Vienna state. Those values are probably not 100% exact, but it was fast and being a few meters off will not matter too much. If more precision or more coordinates are required, then coming up with an automated approach is certainly preferable.

We assign the values I have found to variables.

lat_north <- 48.323121
lat_south <- 48.119429

long_west <- 16.182603
long_east <-16.577489

Next, we use them to filter our data. Since we are already filtering, we take the opportunity and apply some further data cleaning. We limit the results to plausible apartment sizes and prices (I know, I have included prices up to 5.000.000€, but plausible and affordable is not necessarily the same). Further, we create the variable price_m2 which represents the price in Euro per square meter.

data_select <- tbl_vienna_buy %>%
select(area, price,
geocoordinate.latitude,
geocoordinate.longitude) %>%
filter(complete.cases(.)) %>%
filter(geocoordinate.latitude <= lat_north,
geocoordinate.latitude >= lat_south,
geocoordinate.longitude >= long_west,
geocoordinate.longitude <= long_east) %>%
filter(area >= 25,
area <= 150,
price >=50000,
price <=5000000) %>%
mutate(price_m2 = price/area)

To use our borders with the ggmap package we put them into a bbox.

box <- make_bbox(geocoordinate.longitude, geocoordinate.latitude, data_select)

To comply with the terms of usage of ImmobilienScout24 we have to include their logo into our graph. Therefore we download the png file and render it as a raster object so that ggplot2 can use it.

download.file("https://api.immobilienscout24.de/content/dam/is24/images/api/api-logo-immobilienscout24_120x85_font_blau.png",
"logo.png", mode = "wb")

logo <- grid::rasterGrob(img, interpolate = TRUE)

vienna <- get_stamenmap(box, zoom = 13,
maptype = c("toner-lite"))
vienna
#> 1900x2352 toner-lite map image from Stamen Maps.  see ?ggmap to plot it.

## Density heatmap

Now, we can start to visualize the spatial data. Before we take a look at the prices I am interested in where apartments are for sale. To answer this question we overlay the map of Vienna with a density heatmap. The greater the density (color-coded: blue = low, red = high), the more apartments are for sale in the given region. For better readability we map the density not only to the color, but to contours and alpha channel (transparency), too. At last, we add the logo in the lower right corner.

density_map <- ggmap(vienna, extent = "device") +
coord_cartesian() +
geom_density2d(data = data_select,
aes(x = geocoordinate.longitude,
y = geocoordinate.latitude), size = 0.3) +
stat_density2d(data = data_select,
aes(x = geocoordinate.longitude,
y = geocoordinate.latitude,
fill = ..level.., alpha = ..level..),
size = 0.01,
bins = 16,
geom = "polygon") +
scale_fill_gradient2(low="blue", mid = "green", high = "red",
midpoint = 80, guide = FALSE) +
scale_alpha(range = c(0.1, 0.4), guide = FALSE)+
annotation_custom(logo, xmin=16.5375, ymax = 48.14)

density_map

We see, that there are some hot spots, when it comes to selling apartments in Vienna. On the other side, in some districts only a few apartments are for sale.

## Visualizing price

Next we are interested in the price. Before we look at its spatial distribution, let’s look at its general distribution.

price_dist <- data_select %>%
ggplot(aes(x=price_m2)) +
geom_histogram(bins = 25,
fill = "white", color="black")

price_dist

As was to be expected, we find a long tail on the right side. The prices per square meter are clearly not normally distributed (some are just too expansive). This could pose a problem to the next step, when we will visualize the mean price given the location, because the mean is not a good statistic for skewed distributions (one really high value could obfuscate a low value region). However, in this case it is no problem, because the location is the key factor for the distortion we see. Although, I’m not going to show it here: the high prices are mostly within high price regions in which themselves the prices are normal. A quick and dirty approach to check this claim is to use the median instead of the mean. If the result is more or less the same (and it is), then you should be fine.

Anyways, let’s create the plot.

price_map <- ggmap(vienna, extent = "device") +
coord_cartesian() +
stat_summary_hex(data = data_select,
aes(x = geocoordinate.longitude,
y = geocoordinate.latitude,
z = price_m2),
fun = function(x) mean(x), alpha = 0.75) +
low = "blue", mid="green", high = "red", midpoint = 7500) +
annotation_custom(logo, xmin=16.5375, ymax = 48.14)

price_map

The results are not overly surprising but interesting anyhow. As was to be expected, we find the highest prices in the center and in the more wealthy neighborhoods.

Given that my wife and I don’t look for an apartment in the state of Vienna, but in the City of Vienna we will zoom in a little more.

At first we define the new borders and filter our data …

lat_north2 <- 48.28
lat_south2 <- 48.15

long_west2 <- 16.25
long_east2 <-16.50

data_select2 <- data_select %>%
filter(geocoordinate.latitude <= lat_north2,
geocoordinate.latitude >= lat_south2,
geocoordinate.longitude >= long_west2,
geocoordinate.longitude <= long_east2)

box2 <- make_bbox(geocoordinate.longitude,
geocoordinate.latitude,
data_select2)

vienna2 <- get_stamenmap(box2, zoom = 14, maptype = c("toner-lite"))
vienna2
#> 2359x3129 toner-lite map image from Stamen Maps.  see ?ggmap to plot it.

… and plot a more detailed map.

price_map2 <- ggmap(vienna2, extent = "device") +
coord_cartesian() +
stat_summary_hex(data = data_select2,
aes(x = geocoordinate.longitude,
y = geocoordinate.latitude,
z = price_m2),
fun = function(x) mean(x), alpha = 0.75) +
low = "blue", mid="green", high = "red", midpoint = 7500) +
annotation_custom(logo, xmin=16.48, ymax = 48.16)

price_map2

Hm … as interesting as this plot is, it makes me wonder whether spending my time with a second job would be wiser than blogging.

Just kidding, it is way too much fun.

# Closing Remarks

And thanks to ImmobilienScout24 again!

If something is not working as outlined here, please check the package versions you are using. The system I have used was:

sessionInfo()
#> R version 3.3.1 (2016-06-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 15063)
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] hexbin_1.27.1    bindrcpp_0.2     ggmap_2.6.1      jsonlite_1.5
#>  [5] httr_1.2.1       dplyr_0.7.0      purrr_0.2.2.2    readr_1.1.1
#>  [9] tidyr_0.6.3      tibble_1.3.3     ggplot2_2.2.1    tidyverse_1.1.1
#> [13] kableExtra_0.2.1
#>
#> loaded via a namespace (and not attached):
#>  [1] reshape2_1.4.2    haven_1.0.0       lattice_0.20-33
#>  [4] colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14
#>  [7] rlang_0.1.1       foreign_0.8-68    glue_1.1.0
#> [13] bindr_0.1         jpeg_0.1-8        plyr_1.8.4
#> [16] stringr_1.2.0     munsell_0.4.3     gtable_0.2.0
#> [22] mapproj_1.2-5     psych_1.7.5       evaluate_0.10
#> [25] labeling_0.3      knitr_1.16        forcats_0.2.0
#> [28] parallel_3.3.1    highr_0.6         broom_0.4.2
#> [31] proto_1.0.0       Rcpp_0.12.11      geosphere_1.5-5
#> [34] scales_0.4.1      backports_1.1.0   mnormt_1.5-5
#> [37] rjson_0.2.15      hms_0.3           png_0.1-7
#> [40] digest_0.6.12     stringi_1.1.5     grid_3.3.1
#> [43] rprojroot_1.2     tools_3.3.1       magrittr_1.5
#> [46] maps_3.2.0        lazyeval_0.2.0    MASS_7.3-45
#> [49] xml2_1.1.1        lubridate_1.6.0   assertthat_0.2.0
#> [52] rmarkdown_1.6     R6_2.2.2          nlme_3.1-131