Thursday, March 15, 2018

Some basic data wrangling and feature engineering in R and Python

pyr.utf8

Hi folks! In this post I will continue work from a previous posts - Finding Vienna’s House Price Regions. Regionalization with R and ClustGeo and k-Nearest Neighbors from Scratch in R - by adding features to the data. So this post will demonstrate some data wrangling and feature engineering and to keep it interesting I will not only do it in R but try it in parallel in python at the same time.

Because this post is based on previous posts I will summarize them shortly and give a concise description why we will add further features in the present post. If you are only interested in an how to, then you can skip the next two paragraphs.

OK, so you are here for the whole story - excellent. In the initial post to this blog post arc we have created spatially constraint clusters of sold single-family and duplex houses in Vienna, Austria in 2016 based one their price per square meter. But for 20% of the homes we did not assign a cluster, because we put those aside as a test set. In the subsequent post we explored how we can assign cluster membership to new cases. In future posts I plan to try out some methods to predict the house prices, however at the moment we do not have many features except the cluster membership to work with. Hence, we will add some more. Since we began with the idea, that house prices are all about their location we will roll with it and add location specific features. We will include availability of public transportation.

Please note that this post is no R vs python race. In fact, my motivation for this post is to learn a little more python although I am a long term fan of R. Both languages have their merits in data science and while I am quite satisfied with R for my life, there are some areas where python clearly excels (e.g., deep learning). However, regardless of the models, the largest part of work is most often slicing and dicing data. And that I have not done in python so far. Yet, I like to keep my projects in one language, because IMO this facilitates the workflow and documentation (e.g., having everything in one single notebook). So here we are, wondering if I can do in python, what I do in R. Let’s check it out!

Preparations

Before the fun begins some preparations are needed.

Packages

First, we will load the packages we will work with.

R

As you probably already know I am a huge tidyverse fan-boy, so …

python

In python we will work with pandas and numpy

Data

Next, we need data to work with.

Get the Data

We will use the result from my k-NN post - if you have missed that one, then you can either download the data here and load it or use the code below for your convenience.

Further, we will download from the Austrian open data platform https://www.data.gv.at

R

Let’s start by creating a data directory.

Next, get the result from my k-NN post

… then the public transportation data

python

Create a data directory.

Get the k-NN data …

Quick check of the data

We already know how the k-NN data looks from the previous blog post, but the public transportation data is new to us. Let’s take a quick look at it.

Stops and routes in R:

## # A tibble: 4,362 x 4
##           stop_id                 stop_name stop_lat stop_lon
##             <chr>                     <chr>    <dbl>    <dbl>
##  1 at:43:3121:0:1         Baden Josefsplatz 48.00595 16.23370
##  2 at:43:3134:0:1             Baden Viadukt 48.00384 16.24097
##  3 at:43:3134:0:2             Baden Viadukt 48.00373 16.24091
##  4 at:43:3142:0:3                  Leesdorf 47.99951 16.25161
##  5 at:43:3142:0:4                  Leesdorf 47.99956 16.25165
##  6 at:43:3556:0:2 Großenzersdorf Busbahnhof 48.19983 16.54943
##  7 at:43:3558:0:1 Großenzersdorf Stadtmauer 48.20492 16.54939
##  8 at:43:3558:0:2 Großenzersdorf Stadtmauer 48.20540 16.54905
##  9 at:43:3615:0:3    Guntramsdorf Lokalbahn 48.04831 16.31208
## 10 at:43:3615:0:4    Guntramsdorf Lokalbahn 48.04827 16.31219
## # ... with 4,352 more rows
## # A tibble: 359 x 7
##        route_id agency_id route_short_name route_long_name route_type route_color route_text_color
##           <chr>     <int>            <chr>           <chr>      <int>       <chr>            <chr>
##  1   22-1-j18-1         1                1 Prater, Haup...          0        <NA>             <NA>
##  2   22-1-j17-1         1                1 Prater, Haup...          0      C00808           FFFFFF
##  3  22-10-j18-1         1               10 Dornbach - U...          0        <NA>             <NA>
##  4  22-10-j17-1         1               10 Dornbach - U...          0      C00808           FFFFFF
##  5 23-106-j18-1         1              106 Ringlinie Ze...          3        <NA>             <NA>
##  6 23-106-j17-1         1              106 Ringlinie Ze...          3      0A295D           FFFFFF
##  7 23-10A-j18-1         1              10A Niederhofstr...          3        <NA>             <NA>
##  8 23-10A-j17-1         1              10A Niederhofstr...          3      0A295D           FFFFFF
##  9 23-11A-j18-1         1              11A Heiligenstad...          3        <NA>             <NA>
## 10 23-11A-j17-1         1              11A Heiligenstad...          3      0A295D           FFFFFF
## # ... with 349 more rows

and trips and times in python

##               route_id service_id                     trip_id  \
## 0           22-2-j18-1         T0      5681.T0.22-2-j18-1.1.H   
## 1           22-2-j18-1         T0      5687.T0.22-2-j18-1.1.H   
## 2           22-2-j18-1         T0      5718.T0.22-2-j18-1.1.H   
## 3           22-2-j18-1         T0      5737.T0.22-2-j18-1.1.H   
## 4           22-2-j18-1         T0      5742.T0.22-2-j18-1.1.H   
## ...                ...        ...                         ...   
## 233370  23-48A-W-j17-1       UW#1  1741.UW.23-48A-W-j17-1.5.R   
## 233371  23-48A-W-j17-1       UW#1  1742.UW.23-48A-W-j17-1.5.R   
## 233372  23-48A-W-j17-1       UW#1  1939.UW.23-48A-W-j17-1.5.R   
## 233373  23-48A-W-j17-1       UW#1  1945.UW.23-48A-W-j17-1.5.R   
## 233374  23-48A-W-j17-1       UW#1  1932.UW.23-48A-W-j17-1.5.R   
## 
##                   shape_id                trip_headsign  direction_id  \
## 0           22-2-j18-1.1.H  Wien Friedrich-Engels-Platz             0   
## 1           22-2-j18-1.1.H  Wien Friedrich-Engels-Platz             0   
## 2           22-2-j18-1.1.H  Wien Friedrich-Engels-Platz             0   
## 3           22-2-j18-1.1.H  Wien Friedrich-Engels-Platz             0   
## 4           22-2-j18-1.1.H  Wien Friedrich-Engels-Platz             0   
## ...                    ...                          ...           ...   
## 233370  23-48A-W-j17-1.5.R           Wien Gutraterplatz             1   
## 233371  23-48A-W-j17-1.5.R           Wien Gutraterplatz             1   
## 233372  23-48A-W-j17-1.5.R           Wien Gutraterplatz             1   
## 233373  23-48A-W-j17-1.5.R           Wien Gutraterplatz             1   
## 233374  23-48A-W-j17-1.5.R           Wien Gutraterplatz             1   
## 
##         block_id  
## 0            NaN  
## 1            NaN  
## 2            NaN  
## 3            NaN  
## 4            NaN  
## ...          ...  
## 233370       NaN  
## 233371       NaN  
## 233372       NaN  
## 233373       NaN  
## 233374       NaN  
## 
## [233375 rows x 7 columns]
##                           trip_id arrival_time departure_time  \
## 0         276.UG.11-WLB-j18-1.1.H     05:20:00       05:20:00   
## 1         276.UG.11-WLB-j18-1.1.H     05:23:00       05:23:00   
## 2         276.UG.11-WLB-j18-1.1.H     05:25:00       05:25:00   
## 3         276.UG.11-WLB-j18-1.1.H     05:27:00       05:27:00   
## 4         276.UG.11-WLB-j18-1.1.H     05:28:00       05:28:00   
## ...                           ...          ...            ...   
## 4276513  3960.T0.24-N75-j17-1.2.R     04:34:00       04:34:00   
## 4276514  3960.T0.24-N75-j17-1.2.R     04:36:00       04:36:00   
## 4276515  3960.T0.24-N75-j17-1.2.R     04:38:00       04:38:00   
## 4276516  3960.T0.24-N75-j17-1.2.R     04:40:00       04:40:00   
## 4276517  3960.T0.24-N75-j17-1.2.R     04:42:00       04:42:00   
## 
##                  stop_id  stop_sequence  pickup_type  drop_off_type  \
## 0         at:49:1503:0:2              1            0              0   
## 1          at:49:267:0:5              2            0              0   
## 2         at:49:849:0:12              3            0              0   
## 3          at:49:689:0:1              4            0              0   
## 4          at:49:755:0:2              5            0              0   
## ...                  ...            ...          ...            ...   
## 4276513   at:49:1471:0:1             11            0              0   
## 4276514    at:49:743:0:7             12            0              0   
## 4276515    at:49:245:0:8             13            0              0   
## 4276516  at:49:1193:0:13             14            0              0   
## 4276517   at:49:975:0:17             15            0              0   
## 
##          shape_dist_traveled  
## 0                   0.000000  
## 1                6493.743482  
## 2               15214.604535  
## 3               21142.441074  
## 4               24910.733613  
## ...                      ...  
## 4276513         44929.882363  
## 4276514         48607.153466  
## 4276515         53194.793394  
## 4276516         60318.117479  
## 4276517         66658.625817  
## 
## [4276518 rows x 8 columns]

Creating New Features

Now, that we have data about Vienna’s public transportation system we can use it to create new, useful features. To do so some domain knowledge is required (the more the better). If you are not using deep learning, then this is the general case. Even sophisticated models will result in garbage, if you put garbage in. So do some brainstorming on what features might actually be predictive before throwing everything into a [ ] (insert algorhithm of your choice).

In our case the features I would like to create are

  1. distance to nearest public transportation stop (all kinds)

  2. distance to nearest subway station (Vienna’s subway system is incredible)

  3. number of different public transportation routes within a 500 meter radius

I can think of many other feature variants e.g., some that differentiate more between the different types of public transportation (bus, tram, sub-way) or aggregate the stops in vicinity, but I believe those three are a good starting point.

Note: Would this be a real project and the quality of your results really mattered, then, of course, creating more features followed by evaluating which actually work, would seem wise to me. Luckily, I am just doing it for fun, so we will not make it more tedious than necessary.

1. distance to nearest public transportation stop

The reasoning for this feature is that homes are probably worth more when there is a stop nearby. (On the other side, having a bus stop directly next to your bedroom might not be desirable.)

Getting this feature is not that hard. We have the coordinates of the homes and the coordinates of all stops, so we only need to compute the distances and select the smallest per home. One problem however is that one degree in longitude does not equal the distance of one degree in latitude. In the previous posts we neglected this problem, but now we will try to do it right by using Vincenty’s formulae.

R

In R we can use the geosphere package. The function distVincentyEllipsoid() gives us the distance between two points and it would be easy to put it into a wrapper function that calculates all distances per home and selects the smallest. However, we will use distm() which calculates all distances between two sets of coordinates, calculate all distances in one go and then select the smallest per case. We do that, because calculating all these distances is costly and we can reuse them when calculating the other two features.

Note: This will take time. To save you save you some time you can download the result as a rds file from here and load it with read_rds().

Given that this really took some time we save the result to disk.

I prefer tibbles over matrices so we convert the data accordingly and name the columns to avoid confusion later on.

Finally, we look up the smallest value per case.

python

In python we can use vicenty from the geopy package, but first we create lists for the coordinates of homes and stops.

Then we calculate all pairwise distances …

Note: This will take time. To save you some time you can download the results converted to a pandas dataframe as a pickle from here and load it with pickle.load().

… and convert the results (list of lists) to a pandas data frame.

Like in R, calculating all distances took some time so we save them to disk.

Finally, we look up the smallest value per case.

2. distance to nearest subway station

The reasoning for this feature is that homes are probably worth more when there is a subway station nearby. In Vienna subways are the fastest means of public transportation.

Unfortunately, getting this feature is a little harder than the previous. In principle, we only need to filter our distance table for subway stops, but to get this info we will need all four public transportation tables (stops, times, trips, and routes). We find data about the type of transportation in the routes table in the column route_short_name, due to Vienna’s naming conventions. Trams get only a number with one or two digits (e.g., 5, and 33) or a single character (e.g., D and O), buses get a number with one or two digits followed by either “A” or “B” (e.g., 11A). Some bus routes also offer service at night hours (00:00 until 04:00). Those, so called nightlines, are named with an “N” followed by a number with one or two digits (e.g., N29). Lastly, subway routes get an “U” followed by one digit (e.g., U1). We can use this to reduce the routes to those of subways. Next, we can use the resulting route_ids to filter the trips for subway trips. This gives us the trip_ids of all subway trips. We use those to filter times to the stops of the subway routes. This gives us what we were looking for: the stop_ids of subway stops which we finally use to filter our distance table.

R

In R we will use filter() and left_join until we have the desired outcome. Please note that the last pipe operator is a %$% from the magrittr package and not a normal %>% pipe. Hence, the result is a vector and not a tibble.

Let’s have a quick glance at first 10 the subway stops

##  [1] "Perfektastraße"      "Meidling"            "Pilgramgasse"       
##  [4] "Hütteldorfer Straße" "Praterstern"         "Spittelau"          
##  [7] "Rathaus"             "Reumannplatz"        "Rochusgasse"        
## [10] "Roßauer Lände"

As far as I know those really are subway-stations. Next, we use stops_subway to filter the distance table and look for the closest distance to a subway-station.

python

In python we will try the same approach we used in R: identify the subway routes and then consecutively (left) join the data frames to filter to the cases we are looking for. Like in R we retrieve the resulting stop_ids not as a data frame but as a series.

Next, we use stops_subway_py to filter the distance table and look for the closest distance to a subway-station.

3. number of different public transportation routes within a 500 meter radius

The reasoning for this feature is that more different routes in the vicinity allow faster travelling to arbitrary target locations and are therefor valuable.

From my point of view out of the three features this is the hardest to create. First, we need to identify all routes per stop. Then, per home filter for all stops within the vicinity, and then concatenate all unique stops.

R

We start by adding route_short_name from dat_pt_routes to dat_pt_trips ( Note: in this case the short name is better than the route id, because both direction of the same route have different ids) and link the result to dat_pt_times. Now we have information about the route of each stop in the timetable. We reduce this table to the stops and routes and remove all duplicate cases. At, last we convert from long to wide, assigning a column to each stop and a row to each route. As values we use 1 and fill missing combinations of stop-route with 0.

Now comes the hard part - each home has a different set of stops within a 500 meter radius and for those we need the number of distinct routes.

Let’s start by looking up the nearest stops per home.

Note: This takes some time (absolutely and compared to the python solution). Do you have a more efficient solution for this problem? Then please let me know how you would do it as a comment. More readable solutions are ALWAYS welcome, too.

Finally, we use those to sum up the routes per home. To do so we select from route_stop those columns per case which are included in stop_500. From the resulting data frame we calculate the row means. If the row mean is zero, then the corresponding route (see route_stop[["route_short_name"]]) did not include any of the stops within 500 meters. Consequently, we count the number of row means greater than zero to get the number of unique routes per home. Lastly, some homes did not have any stops within close range. We replace their missing values with 0s.

python

In python we proceed similarly to R. We start by adding the route_short_name to datpy_pt_trips, which we than join with datpy_pt_times. We reduce the result by selecting only the variables stop_id and route_short_name and removing all duplicated cases. At last we convert from long to wide, assigning a column to each stop and a row to each route. To do so via pandas.unstack we need to convert the corresponding columns to indices. As values we use 1 and fill missing combinations of stop-route with 0.

Next, we identify the near stops per home.

Finally, we sum up the routes per home. Like in R we select only those columns of route_stop_pd per home which are in stop_500_py. Row means larger than 0 (at least one 1) indicate that one of the nearby stops is part of the respective route. So we calculate the row means and count all non-zero values. Since some homes did not have any stops within close range they produce missing values as row means. We have to remove those because the resulting NaN would be counted by pandas.nonzero(), too.

Add features to data set

We have calculated all features as individual vectors, let’s bind those to our original data.

R

python

Evaluating Features

Now that we have the new features we can evaluate whether they were worth the effort or not. In most cases we would already have one or more models in place and could check how much those improve by adding those features. Since this blog post arc has not progressed this far yet (but it will - pinky swear) we cannot simply extend an existing model. Instead we will simply check the pairwise association between our features and the target (price per m²). The problems with this approach are that 1.) a pairwise effective feature might add no incremental value to a set of features because its explanatory value is already covered, 2.) there are many types of models and just because we find no e.g., linear association does not mean that there is no association, and 3.) a feature might function differently in different subgroups and the effects cancel each other out if we only look at the whole sample. I am certain that there are more problems, but those are the first ones that popped up in my mind.

We cannot do much about 1.), because we do not have a functioning set of features yet. To tackle 2.), at least a little, we will check for all regional price clusters that we have identified in Finding Vienna’s House Price Regions. Regionalization with R and ClustGeo individually. To take on 3.) we will make scatter plots because humans are really good at seeing associations of any kind and calculate the distance correlation - a nonlinear dependence measure for monotonic and non-monotonic associations.

We start in R with evaluating distance to nearest public transportation stop, move to python to do the same for distance to nearest subway station, and skip number of different public transportation routes within a 500 meter radius for you to try alone.

R - distance to nearest public transportation stop

Let’s start with the scatter plots …

Note: Because most minimum distances are rather short (Vienna’s public transportation system is really great) we log transform the according axis for better legibility. The same is true for the price with only a few really expensive objects, but instead of transforming the axis we use the pre-calculated log price variable. Further, we use transparency for the points to compensate for the overlap due to many similar points.

Hm … that does not look like too much of a relationship between those two (maybe in cluster 1 there is some negative dependency). Let’s check whether the distance correlation confirms this impression. In R we can use dcor() from the energy package.

## # A tibble: 5 x 2
##   c_coarse       dcor
##      <int>      <dbl>
## 1        1 0.21780622
## 2        2 0.06799048
## 3        3 0.06291367
## 4        4 0.08961013
## 5        5 0.07088959

That doesn’t look like much either, except for the first cluster. Of course it might still be a good feature in combination with others, which we have not included in this analysis. Let’s check with python whether the distance to the nearest subway station is more promising.

python - distance to nearest subway station

Like in R we log transform the distance and use transparency for the points. Note: Theoretically there is the option in seaborn to log transform the x-axis with .set(xaxis='log') , but it did not work out for me, so I transformed the data instead.

Well, except for cluster 1 and maybe 4 that does not look too promising either. Let’s check the distance correlation. In python we can use distance_correlation() from the dcor package to calculate it. Note: I have found no way of aggregation in pandas using a function over multiple columns, so I have created a helper function. If you know a direct way - Please let me know.

## c_coarse
## 1    0.253285
## 2    0.144423
## 3    0.133089
## 4    0.224432
## 5    0.064474
## dtype: float64
## 
## C:\PROGRA~3\ANACON~1\lib\site-packages\dcor\_utils.py:88: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
##   return ((np.issubdtype(x.dtype, float) and

Seems the impression about clusters 1 and 4 was not wrong. Although the plots looked quite similar the distance to the nearest subway station seems to work a little better than when we pool all means of public transportation together.

Closing Remarks

I hope you have enjoyed our short digression into feature engineering and the comparison of R and python. Looking back in my opinion the given problems were similarly hard/easy in both languages, but please let me know if you think otherwise. Of course the route we took is not the only feasible one. In both R and python different approaches that lead to the same results exist. If you have a better solution for one part or another then please do not hesitate and post it in the comment section.

If you want to learn more about data slice ‘n’ dice from real pros instead of a dilettante like me, then I can recommend:


If something is not working as outlined here, please check the package versions you are using. The system I used was:

R

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
## [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Austria.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2     magrittr_1.5     forcats_0.2.0    stringr_1.2.0   
##  [5] dplyr_0.7.4      purrr_0.2.4      readr_1.1.1      tidyr_0.7.2     
##  [9] tibble_1.3.4     ggplot2_2.2.1    tidyverse_1.2.1  reticulate_1.4  
## [13] kableExtra_0.6.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.3  reshape2_1.4.2    haven_1.1.0      
##  [4] lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6  
##  [7] viridisLite_0.2.0 yaml_2.1.14       rlang_0.1.4      
## [10] foreign_0.8-69    glue_1.2.0        modelr_0.1.1     
## [13] readxl_1.0.0      bindr_0.1         plyr_1.8.4       
## [16] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0 
## [19] rvest_0.3.2       psych_1.7.8       evaluate_0.10.1  
## [22] labeling_0.3      knitr_1.17        parallel_3.4.2   
## [25] broom_0.4.2       Rcpp_0.12.13      scales_0.5.0     
## [28] backports_1.1.1   jsonlite_1.5      mnormt_1.5-5     
## [31] hms_0.3           digest_0.6.12     stringi_1.1.5    
## [34] grid_3.4.2        rprojroot_1.2     cli_1.0.0        
## [37] tools_3.4.2       lazyeval_0.2.1    crayon_1.3.4     
## [40] pkgconfig_2.0.1   xml2_1.1.1        energy_1.7-2     
## [43] lubridate_1.7.1   assertthat_0.2.0  rmarkdown_1.7    
## [46] httr_1.3.1        rstudioapi_0.7    boot_1.3-20      
## [49] R6_2.2.2          nlme_3.1-131      compiler_3.4.2

python

python version

## 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]

package version

## pandas: 0.22.0
## numpy: 1.14.1
## seaborn: 0.8.1
## geopy: 1.11.0
## dcor: 0.1.5

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

botometer_publish.utf8.md Hi folks! I guess you are aware that social medi...

Popular Posts