Sunday, October 21, 2018

Trying out R's keras API - Can we predict the ETH price with recurrent neural networks?

rnn.utf8.md
Hi folks!
In this post, I will combine two of my passions: data science and cryptocurrencies. To be more specific, we will try to predict the minimum and maximum price of Ethereum (ETH) traded on binance for Bitcoins (BTC) within the next 10 minutes. As data, we will use the kline/candlestick data of the previous 180 minutes. Analyzing candlestick data and using it for predictions is a science in its own right, but I do not plan to make a deep dive into that. Instead, we will try to make our predictions with recurrent neural networks (RNN). Like all neural networks, RNNs learn the features on their own and reduce the need for manual feature engineering.
We will try out different RNN architectures and do some hyperparameter training. We will do so using R - I know, some time ago I stated that I prefer python for deep learning. The reason why we still use R instead of python is that R finally got a keras API and I am more than thrilled to take it for a spin.
Before we start I like to mention that I have not seen too many working crypto price prediction models out there - maybe it cannot be done, perhaps it can, and people keep them for themselves to get rich. Anyway, let’s try our luck!

Preparations

General

library(tidyverse)
library(magrittr)
library(keras)
library(mlrMBO)
Next, we set a seed. I do this to make everything reproducible, but honestly, I doubt that it will work. The seed controls all “randomness” within R, but hardly anything will be computed in R but only piped through the API to keras which runs in python. Hence, I am not sure whether the seed has any effect at all (if you follow along on your own machine, then please let me know whether your results differ).
set.seed(1234)

Getting the data

I harvested some data a time ago that you can download from here. We will use it for model training. Later in this post, we will use binance’s API to get more current data and test our models.
After you have downloaded the data we load it into R. It comprises candlestick data of every minute of the trading pairs BTC-ETH and BTC-USDT. I have included BTC-USDT because BTC is still the leading cryptocurrency and its price in $(or USDT) for sure has an impact on the cost of other currencies like ETH. dat_eth <- read_csv("binance_eth.csv") %>% select(-ignore) dat_usdt <- read_csv("binance_btcusdt.csv") %>% select(-ignore) Let’s take a quick glimpse at the data to see with what we are going to work. dat_eth %>% glimpse() ## Observations: 324,977 ## Variables: 11 ##$ timeopen          <dbl> 1.509491e+12, 1.509491e+12, 1.509491e+12, 1....
## $open <dbl> 0.047613, 0.047819, 0.047700, 0.047703, 0.04... ##$ high              <dbl> 0.047729, 0.047819, 0.047700, 0.047703, 0.04...
## $low <dbl> 0.047613, 0.047819, 0.047700, 0.047636, 0.04... ##$ close             <dbl> 0.047686, 0.047819, 0.047700, 0.047684, 0.04...
## $volume <dbl> 22.662, 2.722, 1.859, 0.246, 10.838, 12.593,... ##$ timeclose         <dbl> 1.509491e+12, 1.509491e+12, 1.509491e+12, 1....
## $quotedVolume <dbl> 1.08098804, 0.13016330, 0.08867430, 0.011725... ##$ trades            <int> 22, 7, 1, 3, 6, 11, 37, 18, 38, 6, 14, 3, 19...
## $takerBuyVol <dbl> 15.052, 0.000, 1.859, 0.136, 8.826, 1.592, 2... ##$ takerBuyQuotedVol <dbl> 0.71822354, 0.00000000, 0.08867430, 0.006485...
We merge the data sets with an inner join on their timestamps (timeopen - we remove timeclose because it is just 60 seconds later and therefore redundant).
dat_raw <- inner_join(dat_eth, dat_usdt, by=c("timeopen")) %>%
select(-contains("timeclose"))

Data preparations

I know I have just told you that neural networks reduce the need for manual feature engineering, but a little is still necessary or at least sensible.
One problem is that in some cases the number of trades within the minute is missing. We replace those missings with 0s, and the network will learn how to deal with them. A further problem is that some variables (volumes, number of trades, and rates of BTC-USDT) have quite a positive skew. We normalize them a little by taking their log - this emphasizes differences between lower values while reducing the differences between high values. This step is not necessary for deep learning, but in my limited experience it pays off. One problem with this transformation, however, is that there is no log of values smaller or equal to 0. Hence, we need to replace all zeros with small positive values.
dat_log <- dat_raw %>%
mutate_at(vars(contains("trades")), function(x) ifelse(is.na(x), 0, x)) %>%
function(x) ifelse(x < 0.000001, (unique(sort(x))[2])/5, x) ) %>%
mutate_at(vars(contains("ume"), contains("Vol"), contains("trades"), ends_with("y")), log)
Finally, we remove the timestamp, because we want the model to learn how the momentary trading patterns effect the price in the near future and not just to remember when it has seen which price (if our model relied mainly on its “price memory”, then it would not be much use for predictions with data it has not seen). An alternative to removing the timestamp completely would be to convert it, e.g., in day of week and time of day. Maybe there are some regularities which our model can pick up (e.g., a big trader or trading bot that always buys/sells at the same time), but this might easily change in the future. Hence, relying only on the trading patterns contained in the candle stick data seems more robust to me.
data <- data.matrix(dat_log[,-1])

Train and validation data

Notice: Normally, we would split the data into three sets: training, validation, and test set. However, for now, we will only divide into training and validation set. For testing purposes, we will use data created in the time needed to train our models, which we will retrieve when our models are ready.
Let’s take a look at the overall dimension of the available data.
dim(data)
## [1] 301868     18
So we have a little over 300k minutes and 18 variables.
Let’s use the first 200 days (200*24*60 = 288000 minutes) for training and the rest for validation. Further, we perform one last data preparation step and standardize all variables to have mean = 0 and standard deviation = 1. Please notice that we use the mean and standard deviation of the training data.
Note: Each variable is standardized on its own. In most cases, this is the way to go, but in our case, it is a tiny error. Later we will predict the high AND low values, but only one loss value can be evaluated. Hence, we will pool the error on high and low predictions. The problem is that high and low have similar, but not equal standard deviations. Since both values are scaled to have a standard deviation of one, the error in one weights a little bit more than in the other. A more balanced solution, e.g., scaling both with their pooled SD, would be preferable, but this small error should not impact our results too much.
train_data <- data[1:288000,]
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
data <- scale(data, center = mean, scale = std)
Even if 288k is not that much data (and would easily fit into the memory as a whole), we choose an approach applicable to more substantial amounts of data and slice it into nice little batches which will be processed one after another.
In keras, a generator achieves that. We define a general version that provides us with a batch of inputs and the associated outputs every time it is called.
generator <- function(data, lookback, delay, min_index, max_index,
shuffle = FALSE, batch_size = 640, steps=1){

if (is.null(max_index))
max_index <- nrow(data) - delay - 1
i <- min_index + lookback
function() {
if (shuffle) {
rows <- sample(c((min_index+lookback):max_index), size = batch_size)
} else {
if (i + batch_size >= max_index)
i <<- min_index + lookback
rows <- c(i:min(i+batch_size, max_index))
i <<- i + length(rows)
}

samples <- array(0, dim = c(length(rows),
lookback / steps,
dim(data)[[-1]]))
targets <- array(0, dim = c(length(rows),2))

for (j in 1:length(rows)) {
indices <- seq(rows[[j]] - lookback, rows[[j]],
length.out = dim(samples)[[2]])
samples[j,,] <- data[indices,]

##targets
##high
targets[j,1] <- max(data[(rows[[j]] + 1):(rows[[j]] + delay),2])
##low
targets[j,2] <- min(data[(rows[[j]] + 1):(rows[[j]] + delay),3])
}

list(samples, targets)
}
}
Next, we define a generator that can be used for new data and only includes inputs but no outputs.
generator_predict <- function(data, lookback, delay, min_index, max_index,
shuffle = FALSE, batch_size = 640, steps=1){
if (is.null(max_index))
max_index <- nrow(data) - delay - 1
i <- min_index + lookback
function() {
if (shuffle) {
rows <- sample(c((min_index+lookback):max_index), size = batch_size)
} else {
if (i + batch_size >= max_index)
i <<- min_index + lookback
rows <- c(i:min(i+batch_size, max_index))
i <<- i + length(rows)
}

samples <- array(0, dim = c(length(rows),
lookback / steps,
dim(data)[[-1]]))

for (j in 1:length(rows)) {
indices <- seq(rows[[j]] - lookback, rows[[j]],
length.out = dim(samples)[[2]])
samples[j,,] <- data[indices,]

}

list(samples)
}
}
At last, we define the specific generators that we will use. We use lookback = 180 setting the considered time to three hours, steps = 1 indicating that two adjacent three hours time series are only shifted by one minute, delay = 10 trying to predict 10 minutes ahead, and batch_size = 640 fixing the batch size to 640 three hours time series. train_gen will be used for training, val_gen for validating, and val_gen_pred is needed for predictions with the validation data.
lookback = 180
steps = 1
delay = 10
batch_size <- 640

train_gen <- generator(
data,
lookback = lookback,
delay = delay,
min_index = 1,
max_index = lookback*1600,
shuffle = TRUE,
steps = steps,
batch_size = batch_size
)
val_gen = generator(
data,
lookback = lookback,
delay = delay,
min_index = lookback*1600+1,
max_index = NULL,
steps = steps,
batch_size = batch_size
)

val_gen_pred = generator_predict(
data,
lookback = lookback,
delay = delay,
min_index = lookback*1600+1,
max_index = NULL,
steps = steps,
batch_size = batch_size
)

val_steps <- (nrow(data) - lookback*1600+1 - lookback) / batch_size

Benchmark

Before we can start building the models, we need one last thing: a benchmark against which we compare the predictions of the models.
We will use the mean absolute error resulting from just using the values of the last record as our predictions. This seemingly easy and naive approach is often quite hard to beat. On a practical level, we iterate through the validation set with val_gen, calculate all resulting absolute errors, and average them.
evaluate_naive_method <- function() {
batch_maes <- c()
for (step in 1:val_steps) {
c(samples, targets) %<-% val_gen()
preds <- samples[,dim(samples)[[2]],2:3]
mae <- mean(abs(preds - targets))
batch_maes <- c(batch_maes, mae)
}
mean(batch_maes)
}
(benchmark <- evaluate_naive_method())
## [1] 0.004545522
So our models need to have a MAE lower than 0.00455 to be any good at all.

Training

Now, all preparations are done, and the exciting part starts. We train models and try to optimize their performance.

Single layer RNN

We start with the most basic RNN architecture - a single layer net. But even with a simple RNN, there are some options from which we need to choose. E.g.:
• What is the optimal size?
• Which cell type should we use (LSTM or GRU)?
• Which optimization algorithm performs best?
Experience helps to find proper starting configurations, but you never really know what configuration works best in a given situation until you try. There are several approaches to find the optimal hyperparameters. The most exhaustive method is a grid search, which iterates through all parameter combinations and chooses the best in the end. The downside is that it can take quite a long time to iterate through all combinations (especially if several hyperparameters need optimization). Since we do not have too much data and the models will not be too complex, grid search would probably be viable in a bearable amount of time. Still, we choose a more efficient approach (with the downside that we probably do not end up with the optimal but with a good solution). We will use bayesian optimization provided by the mlrMBO package.
To apply mlrMBO’s optimization (or any other optimization provided by other packages) we need to wrap the model building into a function that takes the hyperparameters we like to tune as arguments and returns a single numeric value that we want to minimize or maximize.
As arguments, we choose the number of units, their type, and the optimization algorithm and return the mean absolute error in the validation sample.
Note: While the hyperparameter training tries to optimize the MAE, the model training per se is set to optimize the mean squared error. This is somewhat inconsistent, but to me both types of error are important, and I like the compromise of using both. If you disagree, then please let me know in the discussion section.
Note: The different cell types and optimization algorithms have further parameters on their own and including them into the hyperparameter training could be sensible, but we stick to the defaults for now because each additional parameter multiplies hyperparameter training time.
Note: For more information on model building and training check out the later section of this post where we will train models with “optimal” hyperparameters.
simpleRNN_fit <- function(unit_count, cell_type, opt_type){

uc <- unit_count*8

## define callbacks
callbacks_list <- list(
callback_early_stopping(
monitor = "val_loss",
patience = 5
),
callback_model_checkpoint(
filepath = "01_simpleRNN.h5",
monitor = "val_loss",
save_best_only = TRUE
)
)
## build model
model_simpleRNN <- keras_model_sequential()

if(cell_type == "gru"){
model_simpleRNN %<>%
layer_gru(units = uc,
input_shape = list(NULL, dim(data)[[-1]]))
} else if(cell_type == "lstm"){
model_simpleRNN %<>%
layer_lstm(units = uc,
input_shape = list(NULL, dim(data)[[-1]]))
}

model_simpleRNN %<>% layer_dense(units = 2)

## compile model
keras::compile(
model_simpleRNN,
loss = "mse",
metrics = c("mse", "mae")
)
} else if (opt_type == "nadam"){
keras::compile(
model_simpleRNN,
loss = "mse",
metrics = c("mse", "mae")
)
} else if (opt_type == "rmsprop"){
keras::compile(
model_simpleRNN,
optimizer = optimizer_rmsprop(),
loss = "mse",
metrics = c("mse", "mae")
)
keras::compile(
model_simpleRNN,
loss = "mse",
metrics = c("mse", "mae")
)
}

## fit model
model_simpleRNN %>% fit_generator(
train_gen,
callbacks = callbacks_list,
steps_per_epoch = 450,
epochs = 2000,
validation_data = val_gen,
validation_steps = val_steps
)

## evaluate model
eval_results <- evaluate_generator(model_simpleRNN, val_gen, val_steps)

evaluate_generator(model_simpleRNN,val_gen, val_steps)$mean_absolute_error %>% as.numeric() } Next, we define the hyperparameter space of possible values … simpleRNN_ps = makeParamSet( makeIntegerParam("unit_count", lower = 1, upper = 12), makeDiscreteParam("cell_type", values = c("gru", "lstm")), makeDiscreteParam("opt_type", values = c("adam", "nadam", "rmsprop", "adadelta")) ) … and randomly choose some points within it as initial data for the optimization. For the blog post, we take only 14 points, but a higher number might be sensible. Further, please be aware that you can manually add combinations as well. So, if you believe to have a good guess, then include it. simpleRNN_des = generateDesign(n = 14, par.set = simpleRNN_ps) Next, we evaluate the chosen configurations. simpleRNN_des %<>% mutate( y = pmap(list(unit_count, cell_type, opt_type), simpleRNN_fit) ) ## Error in mutate_impl(.data, dots): Evaluation error: AttributeError: 'Sequential' object has no attribute 'metrics'. OK, that is weird. But if the tidy approach is not working, then let’s do it the old-fashioned way (if you know what I have done wrong - please let me know). for (i in 1:dim(simpleRNN_des)[1]){ simpleRNN_des$y[i] <- simpleRNN_fit(simpleRNN_des$unit_count[i], simpleRNN_des$cell_type[i],
simpleRNN_des$opt_type[i]) } Let’s look at the resulting starting values. simpleRNN_des ## unit_count cell_type opt_type y ## 1 6 lstm adam 0.006573573 ## 2 1 lstm nadam 0.004850281 ## 3 7 gru rmsprop 0.006180876 ## 4 11 lstm rmsprop 0.007254474 ## 5 9 gru adam 0.005405560 ## 6 10 gru nadam 0.004455248 ## 7 2 gru adam 0.005086179 ## 8 2 gru nadam 0.005570578 ## 9 3 lstm adadelta 0.006913974 ## 10 5 gru adadelta 0.006262964 ## 11 8 lstm rmsprop 0.005818588 ## 12 10 lstm adadelta 0.004891048 ## 13 4 gru adam 0.004516485 ## 14 12 lstm adadelta 0.005799765 Next, we use the starting values to initialize an OptState object, which links the problem to the result and keeps track of the optimization progress. simpleRNN_ctrl = makeMBOControl() simpleRNN_ctrl = setMBOControlInfill(simpleRNN_ctrl, crit = crit.ei) simpleRNN_opt.state = initSMBO( par.set = simpleRNN_ps, design = simpleRNN_des, control = simpleRNN_ctrl, minimize = TRUE, noisy = TRUE) We update the OptState 16 times (a completely arbitrary number and probably too small if it was not just for a blog post). for (i in 1:16){ prop = proposePoints(simpleRNN_opt.state) y = simpleRNN_fit(prop$prop.points$unit_count, prop$prop.points$cell_type, prop$prop.points$opt_type) updateSMBO(simpleRNN_opt.state, x = prop$prop.points,
y = y)
}
Let’s have a look at which combination would be evaluated next.
proposePoints(simpleRNN_opt.state)
## $prop.points ## unit_count cell_type opt_type ## 34 11 gru nadam ## ##$propose.time
## [1] 0.219
##
## $prop.type ## [1] "infill_ei" ## ##$crit.vals
##               [,1]
## [1,] -2.135247e-07
##
## $crit.components ## se mean ## 1 0.0003345841 0.005409518 ## ##$errors.model
## [1] NA
##
## attr(,"class")
## [1] "Proposal" "list"
We could easily go on and evaluate this combination or others and add them to the OptState, but for now, we conclude the hyperparameter training.
finalizeSMBO(simpleRNN_opt.state)
## Recommended parameters:
## Objective: y = 0.004
##
## Optimization path
## 14 + 16 entries in total, displaying last 10 (or less):
##    unit_count cell_type opt_type           y dob eol error.message
## 21          1      lstm    nadam 0.004612412  21  NA          <NA>
## 22          1      lstm    nadam 0.004866440  22  NA          <NA>
## 24          1      lstm    nadam 0.004694078  24  NA          <NA>
## 25          6      lstm     adam 0.004983207  25  NA          <NA>
## 26          6      lstm     adam 0.004760775  26  NA          <NA>
## 28          6      lstm     adam 0.005319725  28  NA          <NA>
## 29         10       gru    nadam 0.004878603  29  NA          <NA>
## 30          6      lstm     adam 0.005167965  30  NA          <NA>
##    exec.time            ei error.model train.time prop.type propose.time
## 21        NA -1.045612e-06        <NA>          0    manual           NA
## 22        NA -3.552289e-06        <NA>          0    manual           NA
## 23        NA -2.116553e-06        <NA>          0    manual           NA
## 24        NA -7.931060e-07        <NA>          0    manual           NA
## 25        NA -1.027754e-06        <NA>          0    manual           NA
## 26        NA -2.580085e-06        <NA>          0    manual           NA
## 27        NA -6.329430e-07        <NA>          0    manual           NA
## 28        NA -2.318679e-06        <NA>          0    manual           NA
## 29        NA -2.471307e-07        <NA>          0    manual           NA
## 30        NA -2.435425e-07        <NA>          0    manual           NA
##              se        mean
## 21 0.0003982259 0.005415668
## 22 0.0003481275 0.005127366
## 23 0.0006486102 0.005972417
## 24 0.0002483238 0.005037904
## 25 0.0005510765 0.005845696
## 26 0.0005269609 0.005614215
## 27 0.0004980626 0.005772202
## 28 0.0004598631 0.005461888
## 29 0.0001859679 0.004944392
## 30 0.0003434029 0.005423978
The resulting recommended configuration is …
simpleRNN_opt.state$opt.result$mbo.result$x ##$unit_count
## [1] 10
##
## $cell_type ## [1] "gru" ## ##$opt_type
## [1] "nadam"
… resulting in an estimated MAE of
simpleRNN_opt.state$opt.result$mbo.result$y ## [1] 0.004455248 Hm, that is slightly better than the naive benchmark of 0.00455. Hopefully, more complex architectures fare better. Stacked RNN One contender is the stacked RNN architecture. Instead of using just one layer of cells, it stacks several layers over one another. Again, we need to wrap the model building into a function. As arguments, we use the number of layers, the number of units, and the shape (are further layers getting smaller, greater, or stay the same as the previous layers). We will not re-evaluate cell type and optimization algorithm and use the optimal selection in the single layer model. Doing so is a pragmatic decision, but not necessarily the best one. Since all parameters interact with one another, other settings might be preferable for this architecture. So if you have time - check it out. Note: All but the top layer are set to return_sequences = TRUE. Otherwise, they would return only the state of the last cell which would not give consecutive layers much information to train on. stackedRNN_fit <- function(layer_count, unit_count, shape){ ## check input parameter Check <- ArgumentCheck::newArgCheck() if (layer_count < 2) ArgumentCheck::addError( msg = "'layer_count' must be >= 2", argcheck = Check ) if (unit_count < 1) ArgumentCheck::addError( msg = "'unit_count' must be >= 2", argcheck = Check ) if (shape < -2 | shape > 2) ArgumentCheck::addError( msg = "'shape' must be within [-2;2]", argcheck = Check ) ArgumentCheck::finishArgCheck(Check) ## set-up unit_base <- 8 uc <- unit_count*unit_base ## define callbacks callbacks_list <- list( callback_early_stopping( monitor = "val_loss", patience = 5 ), callback_model_checkpoint( filepath = "02_stackedRNN.h5", monitor = "val_loss", save_best_only = TRUE ) ) ## build model model_stackedRNN <- keras_model_sequential() %>% layer_gru(units = uc, input_shape = list(NULL, dim(data)[[-1]]), return_sequences = TRUE) growth <- 1/(shape*2) growth[shape == 0] <- 0 for (i in 2:layer_count){ if (i < layer_count){ model_stackedRNN %<>% layer_gru(units = uc, return_sequences = TRUE) } else { model_stackedRNN %<>% layer_gru(units = uc) } uc <- uc + uc * growth uc <- c(uc, unit_base) %>% max() } model_stackedRNN %<>% layer_dense(units = 2) ## compile model keras::compile( model_stackedRNN, optimizer = optimizer_nadam(), loss = "mse", metrics = c("mse", "mae") ) ## fit model model_stackedRNN %>% fit_generator( train_gen, callbacks = callbacks_list, steps_per_epoch = 450, epochs = 2000, validation_data = val_gen, validation_steps = val_steps ) ## load model model_stackedRNN <- load_model_hdf5("02_stackedRNN.h5") eval_results <- evaluate_generator(model_stackedRNN, val_gen, val_steps) evaluate_generator( model_stackedRNN,val_gen, val_steps)$mean_absolute_error %>%
as.numeric()
}
Again, we define the parameter space, …
stackedRNN_ps = makeParamSet(
makeIntegerParam("layer_count", lower = 2, upper = 5),
makeIntegerParam("unit_count", lower = 1, upper = 10),
makeIntegerParam("shape", lower = -2, upper = 2)
)
… choose a number of starting combinations, …
stackedRNN_des = generateDesign(n = 15, par.set = stackedRNN_ps)
… and evaluate them.
for (i in 1:dim(stackedRNN_des)[1]){
stackedRNN_des$y[i] <- stackedRNN_fit(stackedRNN_des$layer_count[i],
stackedRNN_des$unit_count[i], stackedRNN_des$shape[i])
}
Let’s look at the starting values.
stackedRNN_des
##    layer_count unit_count shape           y
## 1            3          7     2 0.005768042
## 2            2          3     2 0.006272834
## 3            2          2     1 0.004910406
## 4            4          8     0 0.006819058
## 5            2          1     0 0.006527344
## 6            4          7    -1 0.006985018
## 7            4          9     1 0.006050816
## 8            2          6     1 0.005260754
## 9            3          5    -2 0.005264357
## 10           5         10    -1 0.005853208
## 11           5          1    -2 0.005304723
## 12           5         10     0 0.006696024
## 13           3          3    -1 0.005411024
## 14           3          4    -2 0.005591315
## 15           5          5     2 0.005284705
Next, we create the OptState
stackedRNN_ctrl = makeMBOControl()
stackedRNN_ctrl = setMBOControlInfill(stackedRNN_ctrl, crit = crit.ei)
stackedRNN_opt.state = initSMBO(
par.set = stackedRNN_ps,
design = stackedRNN_des,
control = stackedRNN_ctrl,
minimize = TRUE,
noisy = TRUE)
… and let it train for a while.
for (i in 1:15){
prop = proposePoints(stackedRNN_opt.state)
y = stackedRNN_fit(prop$prop.points$layer_count,
prop$prop.points$unit_count,
prop$prop.points$shape)
x = prop$prop.points, y = y) } The next configuration to be evaluated would be … proposePoints(stackedRNN_opt.state) ##$prop.points
##   layer_count unit_count shape
## 9           4          4     1
##
## $propose.time ## [1] 0.242 ## ##$prop.type
## [1] "infill_ei"
##
## $crit.vals ## [,1] ## [1,] -0.002232555 ## ##$crit.components
##            se        mean
## 1 0.002886951 0.002986154
##
## $errors.model ## [1] NA ## ## attr(,"class") ## [1] "Proposal" "list" … but we finish hyperparameter training for now. finalizeSMBO(stackedRNN_opt.state) ## Recommended parameters: ## layer_count=2; unit_count=1; shape=1 ## Objective: y = 0.005 ## ## Optimization path ## 15 + 15 entries in total, displaying last 10 (or less): ## layer_count unit_count shape y dob eol error.message ## 21 2 2 1 0.005336026 21 NA <NA> ## 22 2 1 1 0.004736808 22 NA <NA> ## 23 2 1 1 0.005203352 23 NA <NA> ## 24 2 1 1 0.005320675 24 NA <NA> ## 25 2 1 1 0.005275674 25 NA <NA> ## 26 3 3 0 0.005190356 26 NA <NA> ## 27 3 10 0 0.005690224 27 NA <NA> ## 28 5 10 2 0.022936167 28 NA <NA> ## 29 5 5 0 0.005759698 29 NA <NA> ## 30 2 6 2 0.005913317 30 NA <NA> ## exec.time ei error.model train.time prop.type propose.time ## 21 NA -1.005071e-05 <NA> 0 manual NA ## 22 NA -9.477561e-06 <NA> 0 manual NA ## 23 NA -1.081139e-05 <NA> 0 manual NA ## 24 NA -8.118479e-06 <NA> 0 manual NA ## 25 NA -6.718298e-06 <NA> 0 manual NA ## 26 NA -5.815763e-06 <NA> 0 manual NA ## 27 NA -1.293808e-05 <NA> 0 manual NA ## 28 NA -6.261678e-06 <NA> 0 manual NA ## 29 NA -1.479915e-03 <NA> 0 manual NA ## 30 NA -1.072348e-03 <NA> 0 manual NA ## se mean ## 21 0.0001833585 0.005012616 ## 22 0.0002095543 0.005063587 ## 23 0.0002212725 0.005017067 ## 24 0.0002191364 0.005042574 ## 25 0.0002273631 0.005077032 ## 26 0.0006086373 0.005927065 ## 27 0.0004542407 0.005423839 ## 28 0.0004028942 0.005448340 ## 29 0.0037524021 0.004771086 ## 30 0.0018982689 0.004173024 Now, the optimal configuration guess is … stackedRNN_opt.state$opt.result$mbo.result$x
## $layer_count ## [1] 2 ## ##$unit_count
## [1] 1
##
## $shape ## [1] 1 … and is estimated to yield a MAE of stackedRNN_opt.state$opt.result$mbo.result$y
## [1] 0.004736808
That is a bummer. Even after some hyperparameter training stacked RNNs perform worse on the problem than our naive benchmark.

1d-conv RNN

Let’s try one more architecture. This time we start with a 1d-convolutional layer before we move to recurrent layers. Again, we need to wrap model building into a function to be able to optimize the hyperparameters. Specific to the 1d conv layer are the number of filters and the kernel size.
convRNN_fit <- function(filters, kernel_size, rnn_count, rnn_units){

## check input parameter
Check <- ArgumentCheck::newArgCheck()

if (filters < 1)
msg = "'filters' must be >= 1",
argcheck = Check
)

if (kernel_size < 2)
msg = "'kernel_size' must be >= 2",
argcheck = Check
)

if (rnn_count < 1)
msg = "'rnn_count' must be >= 1",
argcheck = Check
)

if (rnn_units < 1)
msg = "'rnn_units' must be >= 1",
argcheck = Check
)

ArgumentCheck::finishArgCheck(Check)

## set-up

unit_base <- 8

uc <- rnn_units*unit_base

fc <- filters*unit_base

## define callbacks
callbacks_list <- list(
callback_early_stopping(
monitor = "val_loss",
patience = 5
),
callback_model_checkpoint(
filepath = "03_1dconvRNN.h5",
monitor = "val_loss",
save_best_only = TRUE
)
)
## build model
model_convRNN <- keras_model_sequential() %>%
layer_conv_1d(filters = fc,
kernel_size = kernel_size,
activation = "relu",
input_shape = list(NULL, dim(data)[[-1]]))

for (i in 1:rnn_count){
if (i < rnn_count){
model_convRNN %<>%
layer_gru(units = uc,
return_sequences = TRUE)
uc <- uc + uc * 0.5
} else {
model_convRNN %<>%
layer_lstm(units = uc)
}

}

model_convRNN %<>% layer_dense(units = 2)

## compile model
keras::compile(
model_convRNN,
loss = "mse",
metrics = c("mse", "mae")
)

## fit model
model_convRNN %>% fit_generator(
train_gen,
callbacks = callbacks_list,
steps_per_epoch = 450,
epochs = 2000,
validation_data = val_gen,
validation_steps = val_steps
)

eval_results <- evaluate_generator(model_convRNN,val_gen, val_steps)

evaluate_generator(model_convRNN,val_gen, val_steps)$mean_absolute_error %>% as.numeric() } Next, we define the hyperparameter space, … convRNN_ps = makeParamSet( makeIntegerParam("filters", lower = 1, upper = 8), #gets multiplied by 8 makeIntegerParam("kernel_size", lower =3, upper = 10), makeIntegerParam("rnn_count", lower = 1, upper = 2), makeIntegerParam("rnn_units", lower = 1, upper = 10) #gets multiplied by 8 ) … choose random starting combinations, … convRNN_des = generateDesign(n = 14, par.set = convRNN_ps) … and evaluate them. for (i in 1:dim(stackedRNN_des)[1]){ convRNN_des$y[i] <- convRNN_fit(convRNN_des$filters[i], convRNN_des$kernel_size[i],
convRNN_des$rnn_count[i], convRNN_des$rnn_units[i])
}
Let’s look at the starting values.
convRNN_des
##    filters kernel_size rnn_count rnn_units           y
## 1        5           6         2         3 0.006306844
## 2        3           6         2         8 0.004858218
## 3        4           9         2        10 0.006397791
## 4        8           9         1         1 0.013997144
## 5        3           4         1         9 0.005559878
## 6        2          10         2         8 0.005239270
## 7        5           3         2         5 0.005935301
## 8        6           7         1         6 0.006564416
## 9        2           4         1         6 0.005154384
## 10       7           4         1         1 0.006090563
## 11       7           9         1         4 0.006378257
## 12       8           7         2         7 0.009060436
## 13       1           5         1         3 0.004755455
## 14       4           8         2         4 0.008622950
Next, we create the OptState
convdRNN_ctrl = makeMBOControl()
convdRNN_ctrl = setMBOControlInfill(convdRNN_ctrl, crit = crit.ei)
convRNN_opt.state = initSMBO(
par.set = convRNN_ps,
design = convRNN_des,
control = convdRNN_ctrl,
minimize = TRUE,
noisy = TRUE)
… and let it train for a while.
for (i in 1:16){
prop = proposePoints(convRNN_opt.state)
y = convRNN_fit(prop$prop.points$filters,
prop$prop.points$kernel_size,
prop$prop.points$rnn_count,
prop$prop.points$rnn_units)
x = prop$prop.points, y = y) } The next configuration to be evaluated would be … proposePoints(convRNN_opt.state) ##$prop.points
##    filters kernel_size rnn_count rnn_units
## 37       1           4         1         5
##
## $propose.time ## [1] 0.283 ## ##$prop.type
## [1] "infill_ei"
##
## $crit.vals ## [,1] ## [1,] -0.0004018811 ## ##$crit.components
##            se        mean
## 1 0.001935427 0.005617492
##
## $errors.model ## [1] NA ## ## attr(,"class") ## [1] "Proposal" "list" … but we finish hyperparameter training for now. finalizeSMBO(convRNN_opt.state) ## Recommended parameters: ## filters=1; kernel_size=4; rnn_count=1; rnn_units=1 ## Objective: y = 0.005 ## ## Optimization path ## 14 + 16 entries in total, displaying last 10 (or less): ## filters kernel_size rnn_count rnn_units y dob eol ## 21 3 10 2 10 0.014055169 21 NA ## 22 4 5 2 2 0.006309814 22 NA ## 23 2 6 2 4 0.005546650 23 NA ## 24 3 5 2 5 0.006806981 24 NA ## 25 1 5 2 5 0.005864668 25 NA ## 26 1 4 1 1 0.004710464 26 NA ## 27 2 6 1 8 0.005526419 27 NA ## 28 1 3 1 3 0.005252020 28 NA ## 29 1 7 1 1 0.005396571 29 NA ## 30 2 8 2 6 0.005885977 30 NA ## error.message exec.time ei error.model train.time prop.type ## 21 <NA> NA -0.0002779504 <NA> 0 manual ## 22 <NA> NA -0.0002977433 <NA> 0 manual ## 23 <NA> NA -0.0002868495 <NA> 0 manual ## 24 <NA> NA -0.0006456876 <NA> 0 manual ## 25 <NA> NA -0.0005032000 <NA> 0 manual ## 26 <NA> NA -0.0005196650 <NA> 0 manual ## 27 <NA> NA -0.0004569890 <NA> 0 manual ## 28 <NA> NA -0.0004400800 <NA> 0 manual ## 29 <NA> NA -0.0004613463 <NA> 0 manual ## 30 <NA> NA -0.0003792393 <NA> 0 manual ## propose.time se mean ## 21 NA 0.001043168 0.005069259 ## 22 NA 0.002566205 0.006859780 ## 23 NA 0.002509819 0.006834781 ## 24 NA 0.001832746 0.004933277 ## 25 NA 0.001711494 0.005150923 ## 26 NA 0.001616271 0.005023404 ## 27 NA 0.001674741 0.005186315 ## 28 NA 0.001154470 0.004752034 ## 29 NA 0.001883278 0.005386079 ## 30 NA 0.001512320 0.005228841 Now, the optimal configuration guess is … convRNN_opt.state$opt.result$mbo.result$x
## $filters ## [1] 1 ## ##$kernel_size
## [1] 4
##
## $rnn_count ## [1] 1 ## ##$rnn_units
## [1] 1
… and is estimated to yield a MAE of
convRNN_opt.state$opt.result$mbo.result$y ## [1] 0.004710464 Hm … that is slightly better than the stacked RNN but still worse than our naive benchmark. Learn “optimal” models After hyperparameter optimization I am not overly confident that we will end up with great or even good models. Anyhow, let’s use the optimal hyperparameter configurations to train our models. We start by clearing keras’ cache. k_clear_session() single layer RNN Again, we start with the single layer RNN. But before we build the models we set two callbacks. The first indicates that we want to abort training when the validation loss does not decrease for 10 consecutive training epochs. The second saves the model every time a new minimum validation loss is found. callbacks_simple <- list( callback_early_stopping( monitor = "val_loss", patience = 10 ), callback_model_checkpoint( filepath = "01_simple.h5", monitor = "val_loss", save_best_only = TRUE ) ) Next, we define the architecture of the RNN. model_simple <- keras_model_sequential() %>% layer_gru(units = 80, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_dense(units = 2)  summary() gives us an overview of the model. summary(model_simple) ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## gru_1 (GRU) (None, 80) 23760 ## ___________________________________________________________________________ ## dense_1 (Dense) (None, 2) 162 ## =========================================================================== ## Total params: 23,922 ## Trainable params: 23,922 ## Non-trainable params: 0 ## ___________________________________________________________________________ Then we need to compile the model, defining which optimization algorithm and loss to use and which metrics should be tracked. The compile model can then be trained on our data. model_simple %>% compile( optimizer = optimizer_nadam(), loss = "mse", metrics = c("mse", "mae") ) history_simple <- model_simple %>% fit_generator( train_gen, callbacks = callbacks_simple, steps_per_epoch = 450, epochs = 100, validation_data = val_gen, validation_steps = val_steps ) After training has finished we can visualize how the metrics have changed during training. plot(history_simple) The plot is a little bit hard to decipher. plot.keras_training_history() uses ggplot2 as the graphical backbone, so we can use ggplot commands to amend the plot. Let’s try it out by scaling the y-axis to log10. plot(history_simple) + scale_y_log10() This is a little bit more informative. From my point of view the training process seems quite normal. Let’s check how well the model performs in the end. model_simple <- load_model_hdf5("01_simple.h5") evaluate_generator(model_simple,val_gen, val_steps) ##$loss
## [1] 5.61021e-05
##
## $mean_squared_error ## [1] 5.61021e-05 ## ##$mean_absolute_error
## [1] 0.004650629
OK, that is worse than our naive benchmark of 0.004546 - nothing to be excited yet.

stacked RNN

Next, we train the “optimal” stacked RNN.
We use the same callbacks as before, but save the best model to another file.
callbacks_stacked <- list(
callback_early_stopping(
monitor = "val_loss",
patience = 10
),
callback_model_checkpoint(
filepath = "02_stacked.h5",
monitor = "val_loss",
save_best_only = TRUE
)
)
Then we build the model …
model_stacked <- keras_model_sequential() %>%
layer_gru(units = 8,
input_shape = list(NULL, dim(data)[[-1]]),
return_sequences = TRUE) %>%
layer_gru(units = 12) %>%
layer_dense(units = 2)    
… and check its summary().
summary(model_stacked)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #
## ===========================================================================
## gru_2 (GRU)                      (None, None, 8)               648
## ___________________________________________________________________________
## gru_3 (GRU)                      (None, 12)                    756
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 2)                     26
## ===========================================================================
## Total params: 1,430
## Trainable params: 1,430
## Non-trainable params: 0
## ___________________________________________________________________________
Next, we can compile and train it.
model_stacked %>% compile(
loss = "mse",
metrics = c("mse", "mae")
)

history_stacked <- model_stacked %>% fit_generator(
train_gen,
callbacks = callbacks_stacked,
steps_per_epoch = 450,
epochs = 100,
validation_data = val_gen,
validation_steps = val_steps
)
Let’s have a look at the training process …
plot(history_stacked) +
scale_y_log10()

… and how the model performs in the end.
model_stacked <- load_model_hdf5("02_stacked.h5")

evaluate_generator(model_stacked,val_gen, val_steps)
## $loss ## [1] 5.549375e-05 ## ##$mean_squared_error
## [1] 5.549375e-05
##
## $mean_absolute_error ## [1] 0.004800438 Hm … no reason to be happy, either. conv RNN Let’s train the last model, the 1d convolutional + 2 layer RNN net. Again, we define the callbacks, … callbacks_conv <- list( callback_early_stopping( monitor = "val_loss", patience = 10 ), callback_model_checkpoint( filepath = "03_conv.h5", monitor = "val_loss", save_best_only = TRUE ) ) … build the model, … model_conv <- keras_model_sequential() %>% layer_conv_1d(filters = 8, kernel_size = 4, activation = "relu", input_shape = list(NULL, dim(data)[[-1]])) %>% layer_gru(units = 8) %>% layer_dense(units = 2)  … check its summary(), … summary(model_conv) ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## conv1d_1 (Conv1D) (None, None, 8) 584 ## ___________________________________________________________________________ ## gru_4 (GRU) (None, 8) 408 ## ___________________________________________________________________________ ## dense_3 (Dense) (None, 2) 18 ## =========================================================================== ## Total params: 1,010 ## Trainable params: 1,010 ## Non-trainable params: 0 ## ___________________________________________________________________________ … compile, and train it. model_conv %>% compile( optimizer = optimizer_nadam(), loss = "mse", metrics = c("mse", "mae") ) history_conv <- model_conv %>% fit_generator( train_gen, callbacks = callbacks_conv, steps_per_epoch = 450, epochs = 100, validation_data = val_gen, validation_steps = val_steps ) Let’s have a look at the training process … plot(history_conv) + scale_y_log10() … and how the model performs in the end. model_conv <- load_model_hdf5("03_conv.h5") evaluate_generator(model_conv,val_gen, val_steps) ##$loss
## [1] 5.487677e-05
##
## $mean_squared_error ## [1] 5.487677e-05 ## ##$mean_absolute_error
## [1] 0.004629169
Hm … it does not get better.

pool models

At the moment it seems like we have wasted our time so far. But there is still hope. Maybe none of the models is any good on its own, but we can combine them to an ensemble which might perform better.
To do so we need to combine the predictions of all three models. We use predict_generator() to get those …
pred_simple <- predict_generator(model_simple, val_gen_pred, val_steps)
pred_stacked <- predict_generator(model_stacked, val_gen_pred, val_steps)
pred_conv <- predict_generator(model_conv, val_gen_pred, val_steps)
… and merge them into array.
pred_all <- abind::abind(pred_simple, pred_stacked, pred_conv, along = 3)
Let’s check the size of the array as a quick control if everything worked.
dim(pred_all)
## [1] 13461     2     3
This seems right - we have 13461 measurement points, and 3 predictions of each of the 2 targets.

mean

The most common way to pool multiple predictions is to average them.
pred_mean <- apply(pred_all, c(1,2), mean)
After we have done so, we check the size of the array again.
dim(pred_mean)
## [1] 13461     2
The resulting array has only two dimensions, because the three predictions are combined into their average.
Next, we need the targets with which we will compare the predictions.
targets_list <- list()
for (step in 1:val_steps) {
c(samples, targets) %<-% val_gen()
targets_list[[step]] <- targets
}

targets <- do.call(rbind, targets_list)
dim(targets)
## [1] 13461     2
The target array and the prediction array have the same size - so probably everything has worked so far.
Now, we can evaluate the pooled prediction by calculating the mean absolute error.
mae <- mean(abs(pred_mean - targets))

mae
## [1] 0.00445963
Hm … at last the naive benchmark is beaten. However, only by a little. Maybe a weighted average that puts more emphasis on the better performing individual models would work better. If you try it out, please let us know. We move on to other pooling options.

max/min (liberal)

One of those is choosing the minimum or maximum. Since we have two targets we even can combine both approaches. Let’s start with the most liberal prediction, choosing the maximum high and minimum low.
pred_maxmin <- reshape2::melt(pred_all) %>%
group_by(Var1) %>%
summarize(
high = max(1),
low = min(2)
) %>%
ungroup() %>%
select(-Var1) %>%
as.matrix()
Again, we calculate the MAE.
mae <- mean(abs(pred_maxmin - targets))

mae
## [1] 0.005214722
Being too liberal does not seem to be the way to go in this case.

min/max (conservative)

Let’s try the opposite and be as conservative as possible with the minimum high and maximum low.
pred_minmax <- reshape2::melt(pred_all) %>%
group_by(Var1) %>%
summarize(
high = min(1),
low = max(2)
) %>%
ungroup() %>%
select(-Var1) %>%
as.matrix()
One more time, we calculate the MAE.
mae <- mean(abs(pred_minmax - targets))

mae
## [1] 0.004352053
That looks more like it. Let’s check how many percent this prediction is better than our naive benchmark.
((1 - mae/benchmark) * 100) %>%
round(2)
## [1] 4.26
4.26% - I am not overly thrilled, but it demonstrates that some prediction of short term crypto price development is possible. At the end of the post I will discuss some ideas to improve this result even further. But first we need to test it on truly new data.

More data

Preparations

First we need to check the time-frame that is already covered in our data.
(start_old <- dat_raw$timeopen %>% min()) ## [1] 1.509491e+12 (end_old <- dat_raw$timeopen %>% max())
## [1] 1.528989e+12
Cool - for those of us that cannot readily understand UNIX timestamps we convert it into human readable dates.
as.Date(as.POSIXct(start_old/1000, origin="1970-01-01"))
## [1] "2017-10-31"
as.Date(as.POSIXct(end_old/1000, origin="1970-01-01"))
## [1] "2018-06-14"
OK, our data includes the crypto bubble of December 2017 and its burst in early 2018. Accordingly, prices behaved quite differently in our training data than in time span that we will download in a moment (in the last months the trend was sideways to slightly downwards without any big jumps). Further, binance, although the #1 crypto-exchange now, is quite young and has grown considerable in the time-frame covered by the training data. I am intrigued how our models will fare with more current data.
First, we need a helper function to connect to the binance API.
get_chartdata_unix <- function(symbol,
interval = "1m",
start = NULL,
limit = 500){

raw <- httr::GET(
url = "https://api.binance.com",
path = "/api/v1/klines",
query = list (
symbol = symbol,
interval = interval,
limit = limit,
startTime = start
)) %>%
httr::content(as="text", encoding="UTF-8") %>%
jsonlite::fromJSON()

as_tibble(raw) %>%
mutate_all(as.numeric) %>%
setNames(c("timeopen", "open", "high", "low", "close", "volume", "timeclose", "quotedVolume", "trades",
}
Next, we check the time now as an upper margin.
(now <- as.numeric(Sys.time())*1000)
## [1] 1.539112e+12
For those of us (including me) that still cannot parse UNIX timestamps.
as.Date(as.POSIXct(now/1000, origin="1970-01-01"))
## [1] "2018-10-09"
And add one minute (60.000 milliseconds) to the last available point in our data as a lower margin.
start_new <- end_old + 60000
stamp <- start_new
binance_eth_new <- NULL

while(stamp < now){
binance_eth_new <- binance_eth_new %>%
## and slowly grow the data set
bind_rows(get_chartdata_unix("ETHBTC", start = stamp))

stamp <- stamp + 60000*500
## add a little sleep to avoid getting banned
Sys.sleep(0.5)
}

## write to disk
write_csv(binance_eth_new, "binance_eth_new.csv")
… and the new BTC-USDT data.
stamp <- start_new
binance_btcusdt_new <- NULL

while(stamp < now){
binance_btcusdt_new <- binance_btcusdt_new %>%
bind_rows(get_chartdata_unix("BTCUSDT", start = stamp))

stamp <- stamp + 60000*500
Sys.sleep(0.5)
}

write_csv(binance_btcusdt_new, "binance_btcusdt_new.csv")
Combine the new data sets …
dat_raw_new <- inner_join(binance_eth_new,
binance_btcusdt_new,
by=c("timeopen")) %>%
select(-contains("timeclose"))
… and perform all other preparations steps.
dat_new <- dat_raw_new %>%
mutate_at(vars(contains("trades")), function(x) ifelse(is.na(x), 0, x)) %>%
function(x) ifelse(x < 0.000001, (unique(sort(x))[2])/5, x) ) %>%
mutate_at(vars(contains("ume"), contains("Vol"), contains("trades"), ends_with("y")), log) %>%
select(-timeopen, -starts_with("ignore")) %>%
data.matrix() %>%
scale(center = mean, scale = std)
As a last preparation step we build the required generators.
lookback = 180
steps = 1
delay = 10
batch_size <- 640

test_gen <- generator(
dat_new,
lookback = lookback,
delay = delay,
min_index = 1,
max_index = NULL,
shuffle = FALSE,
steps = steps,
batch_size = batch_size
)

test_gen_pred = generator_predict(
dat_new,
lookback = lookback,
delay = delay,
min_index = 1,
max_index = NULL,
shuffle = FALSE,
steps = steps,
batch_size = batch_size
)

test_steps <- (nrow(dat_new) - 1 - lookback) / batch_size

Test on new data

Before we test our models we check how our naive approach performs to get a reasonable benchmark.
test_naive_method <- function() {
batch_maes <- c()
for (step in 1:test_steps) {
c(samples, targets) %<-% test_gen()
preds <- samples[,dim(samples)[[2]],2:3]
mae <- mean(abs(preds - targets))
batch_maes <- c(batch_maes, mae)
}
mean(batch_maes)
}
(benchmark_test <- test_naive_method())
## [1] 0.003478366

single layer RNN

evaluate_generator(model_simple, test_gen, test_steps)
## $loss ## [1] 0.003198003 ## ##$mean_squared_error
## [1] 0.003198003
##
## $mean_absolute_error ## [1] 0.03449668 Wow - that is catastrophic. The prediction quality is an order of magnitude word worse than the naive benchmark. stacked RNN Will the stacked model perform better? evaluate_generator(model_stacked, test_gen, test_steps) ##$loss
## [1] 0.001995372
##
## $mean_squared_error ## [1] 0.001995372 ## ##$mean_absolute_error
## [1] 0.02614605
Clearly, NO!

conv RNN

My hopes are low, but for the sake of completeness - let’s try the convolutional RNN.
evaluate_generator(model_conv, test_gen, test_steps)
## $loss ## [1] 0.0009481604 ## ##$mean_squared_error
## [1] 0.0009481604
##
## \$mean_absolute_error
## [1] 0.01946304
Nah - that is not good.

pooled

OK, all the individual models were useless. Maybe, their ensemble performs better.
Let’s prepare by getting the targets …
targets_list <- list()
for (step in 1:test_steps) {
c(samples, targets) %<-% test_gen()
targets_list[[step]] <- targets
}

targets_test <- do.call(rbind, targets_list)
… and the model predictions …
pred_simple_test <- predict_generator(model_simple, test_gen_pred, test_steps)
pred_stacked_test <- predict_generator(model_stacked, test_gen_pred, test_steps)
pred_conv_test <- predict_generator(model_conv, test_gen_pred, test_steps)
… and merge them into array.
pred_all_test <- abind::abind(pred_simple_test,
pred_stacked_test,
pred_conv_test,
along = 3)

mean

As a first pooling option we try the mean of the predictions.
pred_mean_test <- apply(pred_all_test, c(1,2), mean)
(mae_mean_test <- mean(abs(pred_mean_test - targets_test)))
## [1] 0.02560716
Hm … that did not help too much.

min/max (conservative)

As a last straw we will try the conservative pooling approach, which worked best on the validation set.
pred_minmax_test <- reshape2::melt(pred_all_test) %>%
group_by(Var1) %>%
summarize(
high = min(1),
low = max(2)
) %>%
ungroup() %>%
select(-Var1) %>%
as.matrix()
(mae_minmax_test <- mean(abs(pred_minmax_test - targets_test))) 
## [1] 0.02212562
Oh, my - let’s just consider our attempt of predicting ETH prices with RNNs as failed.

Conclusion

In a nutshell: Forecasting is hard and The Times They Are a-Changin’.
We were able to map some regularities within the training sample, which also seemed to reproduce reasonably in the adjacent time-frame of the validation sample. The crypto-space, however, still evolves rapidly and the last few months have not much in common with the period covered by the training sample. Further, we used data from binance, a very young company with lots of development during the last year, which might have impaired the comparability of training and test data even further.
On an abstract level, the lesson to learn is that systems (might) change and formerly good predictive models can become lousy. Hence, it is crucial to evaluate performance constantly and to update your models.

Some ways to improve fit (not exhaustive)

Our approach failed, but that does not mean that yours must too. Some things that might help you are:
• more data
• time has passed, and more data is available now - use it! (in my opinion, the most important shortcoming of our training data was that it did not cover a sideways market as we can observe in the test data. Hence our models were fully unprepared for this scenerio.)
• we only used a minimal set of features. You can add further useful ones (e.g., data from other exchanges, sentiment in tweets or Reddit, …)
• regularization
• Dropout: amend layer_gru() with dropout and recurrent_dropout. Note: adding dropout to the first recurrent layer does not pay off in my experience
• Max Pooling: add a layer_max_pooling_1d() after layer_conv_1d()
• more models
• the ensemble worked best, there is no reason that you cannot add further models, e.g., different types and architectures, models that use intervals greater than one minute to catch long-running trends, …
• train models to predict the error of your current model instead of the target

Closing Remarks

Although we failed, in the end, I hope you have seen how awesome the keras API for R is and how comfortable you can build even complex neural networks with it. Further, we peaked a little into different RNN architectures and hyperparameter training in general. Maybe, this can be a starting point for your own projects.
If you have any questions or comments, please post them in the comments section. If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base
##
## other attached packages:
##  [1] rgenoud_5.8-2.0   bindrcpp_0.2.2    mlrMBO_1.1.2
##  [4] smoof_1.5.1       checkmate_1.8.5   BBmisc_1.11
##  [7] mlr_2.13          ParamHelpers_1.11 keras_2.2.0
## [10] magrittr_1.5      forcats_0.3.0     stringr_1.3.1
## [16] tidyr_0.8.1       tibble_1.4.2      ggplot2_3.0.0
## [19] tidyverse_1.2.1   conflicted_0.1.0  rmarkdown_1.10
##
## loaded via a namespace (and not attached):
##  [1] httr_1.3.1         jsonlite_1.5       viridisLite_0.3.0
##  [4] splines_3.4.4      modelr_0.1.2       assertthat_0.2.0
##  [7] cellranger_1.1.0   yaml_2.2.0         pillar_1.3.0
## [10] backports_1.1.2    lattice_0.20-35    glue_1.3.0
## [13] reticulate_1.10    digest_0.6.16      RColorBrewer_1.1-2
## [16] rvest_0.3.2        colorspace_1.3-2   htmltools_0.3.6
## [19] Matrix_1.2-12      plyr_1.8.4         pkgconfig_2.0.2
## [22] lhs_0.16           broom_0.5.0        misc3d_0.8-4
## [25] haven_1.1.2        scales_1.0.0       parallelMap_1.3
## [28] DiceKriging_1.5.5  whisker_0.3-2      mco_1.0-15.1
## [31] withr_2.1.2        lazyeval_0.2.1     cli_1.0.0
## [34] survival_2.41-3    RJSONIO_1.3-0      crayon_1.3.4
## [40] nlme_3.1-131       xml2_1.2.0         tools_3.4.4
## [43] data.table_1.11.4  hms_0.4.2          plotly_4.8.0
## [46] munsell_0.5.0      compiler_3.4.4     rlang_0.2.2
## [49] plot3D_1.1.1       grid_3.4.4         rstudioapi_0.7
## [52] htmlwidgets_1.2    base64enc_0.1-3    gtable_0.2.0
## [55] R6_2.2.2           tfruns_1.4         lubridate_1.7.4
## [58] knitr_1.20         tensorflow_1.9     zeallot_0.1.0
## [61] bindr_0.1.1        fastmatch_1.1-0    rprojroot_1.3-2
## [64] stringi_1.2.4      parallel_3.4.4     Rcpp_0.12.18
## [67] tidyselect_0.2.4
Besides the R environment the version of keras and tensorflow might also be of interest. To retrieve this information we use the shell:
python3 -c 'import keras; print(keras.__version__)'
## Using TensorFlow backend.
## 2.2.2
and
python3 -c 'import tensorflow; print(tensorflow.__version__)'
## 1.11.0-rc0
Finally, I have to say that knitting rmarkdown with long-running keras model training did not work with RStudio for me (after a few hours it crashed every time). Instead, I have knitted this post directly in the shell with:
Rscript -e 'library(rmarkdown); rmarkdown::render("rmarkdown.Rmd", "html_document")'