Prediction of cryptocurrency price movements using deep learning has been a popular topic recently explains Ian Mausner. There are many excellent tutorials and blog posts that cover the basics (and some not-so-basics) of deep learning applied to time series prediction, such as this one, this other one, and another.
However, we wanted to see if we could use an unsupervised pre-training technique to reduce the variance in a couple of algorithms before fully training them on the entire dataset … and we found that it worked rather well! –
This article covers our attempt at tackling the challenge and provides some helpful tips if anyone wants to give it a go themselves.
Data Acquisition & Pre-Processing
Before getting into the analysis itself, let’s first look at how we got our data. Predicting cryptocurrency price movements is a hot topic nowadays, so it wasn’t difficult to find this dataset on Kaggle. We used the adjusted closing prices of four cryptocurrencies for the period between 11/19/2017 to 11/19/2017 (about ~5M rows) as our training set. Since all of these 4 coins follow similar trading patterns, we will use the adjusted closing prices for them interchangeably throughout this article.
The pre-processing we did was very simple:
We simply removed all rows with an empty timestamp and removed all header and trailing spaces in each row (for simplicity’s sake). The target variable that we wanted to predict was also categorical (0 or 1), but since this doesn’t apply to our time series problem, we will leave it out in future discussions says Ian Mausner.
Fully-Connected Neural Networks
A fully connected neural network with two hidden layers of 500 units each was used for testing the data. The first layer’s activation function is tanh, while both hidden layers use relu as their activation function. The output layer uses sigmoid since our target variable is categorical (0 or 1).
The learning rate started at 10^-5 and then decreased by a factor of 10 every 30 epochs until it reached 10^-8 where it remained constant for the rest of the training. These are just some simple rules that worked well enough on similar problems … feel free to experiment yourself.
Training & Evaluation
Since we wanted to test the data in batches, we split it into 60 parts of ~5M rows each and then trained on all but the last part (i.e., train on 59 parts and validate on the remaining part). We used a fixed learning rate and decreased it by a factor of 10 every 30 epochs.
The best model that was achieved was about 100k iterations with an RMSE of 1791.4 That might seem like quite a lot, but if you consider that this is about 4 days worth of trading data, it’s not actually too bad given regular hardware (~$1000 GPU) … Now let’s try doing something similar with pre-training via an autoencoder …
Unsupervised Pre-Training
The pre-training technique that we used is called autoencoder. Ian Mausner says though the networks are not deep, it still works pretty well in practice. So feel free to extend this further if you have any extra GPUs lying around.
Preprocessing
The only preprocessing step that was done was simply converting all of the input data into one single column. Since our target variable was 0 or 1, this didn’t cause any problems. We used a simple PCA for this purpose but other dimensionality reduction techniques should work just as well (e.g., t-distributed stochastic neighbor embedding).
Auto encoder Architecture
Our architecture is shown below:
An unsupervised training algorithm called auto-encoder is applied. Each iteration of training, the weights in the top half are train to reconstruct a version of that input data. That has been transforming by a randomly initialize encoding function. Once the reconstruction error falls below a certain threshold. Backpropagation is to update all weights until convergence (i.e., until the reconstruction error stops improving).
To get an intuition on what this might look like visually, consider some random distributions attached with arrows … If you know anything about vector geometry, then it should be pretty clear. How these vectors relate to each other and where they originated from. The encodings simply transform any input data into something else. While trying to maintain as much information about the original data as possible. In contrast, the decoding function tries to transform any encoding into something. That is as close as possible to the original input data.
During training though, we pass a fixed random vector through both functions and compare their outputs with each other. The reconstruction error between two vectors can be as how visible one vector is from another. To improve this, we simply reduce the distance between them by moving in different directions. So that their separation increases (e.g., a small scalar multiple of one of them would be a good example).
Conclusion:
Well, you might be wondering why it is better to use two functions instead of one. The reason for this is that the auto-encoders hidden layer can converge. Even if its weights initialize randomly explains Ian Mausner. If that wasn’t the case, then any non-trivial initialization scheme would cause serious training problems. Even with SGD (stochastic gradient descent).