Using an LSTM for stock prediction
Disclaimer: This blog post is written as part of the Udacity Capstone project for Data Scientist Nanodegree. Accompanying code and README.md can be found here.
Introduction
Wouldn’t we all like to be able to predict the stock market? There’s lots of money involved in trading stocks each day, a day’s trading on NASDAQ regularly exceeds two hundred trillion US dollars. The ones who are able to predict the direction of a given stock the best will take home the biggest prices.
PROJECT DEFINITION
Project Overview
In this project, we will investigate if a deep learning model, an LSTM to be precise, can help us predict the direction of a given stock. For this, we will have to find a dataset about stocks and pre-process this data. Furthermore, we will have to create a model and train it. Lastly, we will evaluate the model to see if it actually helps us.
General approach
The project can be roughly divided into three different parts:
1. LSTM
2. GUI
3. Overview
LSTM
For the first part, the LSTM, I used a KDNugget article as a quick-start guide to LSTMs for stock prediction. Making a few modifications in the data loading it was ready for a first run.
I was positively surprised by the performance of a model this simple! There are only a few layers, I can easily train it on my laptop, and I haven’t enriched the data or done any feature engineering. The model seems able to correctly identify up and down trends.
After a few more test runs and a slight tweaking, the LSTM part was done.
GUI
Not too long ago I came across a post about a Python GUI library that I hadn’t used before called PySimpleGUI. Seeing how I’ve never found TKinter a very intuitive way to design a GUI using Python, I was eager to try this out.
The experiment was a success, it’s easy to get started and it’s intuitive to expand. For this case, I added a few dropdown boxes to select any of the available tickers and three text boxes to input the dates and number of epochs. The “Train!” button starts the training script with the given parameters, and “Exit”… exits.
Trained models overview
I could build a web app to with a neat interface to show all kinds of metrics, but since this is just a hypothetical scenario I opted for something simpler. I created a bootstrap HTML template and replaced parts with model data from a Python script. It’s not the cleanest or most robust solution, but it definitely does the job for now.
Problem statement
Create an interface to train an LSTM to predict stock values in the future. User must be able to select any date range, tickers and number of epochs. Finally, there must be a way to easily compare trained models.
Metrics
Two metrics are used:
— absolute error
— percentage error
Absolute error is calculated as follows:
absolute_error = np.abs(y_hat - y)
Percentage error is calculated as follows:
percentage_error = np.abs(y_hat - y) / y
In both calculations, y is the ground truth.
These two metrics are also calculated for two baseline predictions to estimate the actual performance of the model.
The baselines are:
- Predict last known value for each next value
- Predict linearly from the last two known values for each next value
Why?
These metrics are chosen because the absolute error conveys one thing pretty well: difference with ground truth. A dollar is a dollar, so it’s very useful to know just how far you’re off.
On the other hand, you’re also interested in the difference compared to the size of your investment, this is where MAPE comes in. Since the metric is calculated using a percentage error from the ground truth, it’s very easy to see how much you’re off compared to the entire asset, assuming a single stock.
ANALYSIS
Data Exploration
The dataset can be found on Kaggle. The dataset holds data from the start of each of 7195 different tickers until 2017–11–10. This also means not all tickers have the same amount of data available. Of all tickers, 7163 have data, 32 listed tickers have no data points associated with them.
For each ticker and day, the following features are available:
- Date, trading day
- Open, trading day’s opening value
- High, trading day’s maximum
- Low, trading day’s minimum
- Close, trading day’s closing value
- Volume, volume trading during the entire trading day
The ticker with the most data points is ngl.us. Having one data point each day for every day between 1962–01–02 and 2017–11–10, totalling 14059 data points.
The most common data length is 3201 data points, with 1690 tickers sharing this amount.
Missing values
There are no missing values in the dataset, besides the empty ticker mentioned above. However, since the stock market is closed during the weekends, there are no values available for these days.
In the creation of the training and test sets, there is no distinction between these days, Monday follows directly after the Friday of the day before.
Stock splits and reverse stock splits
The data in the dataset has been adjusted for dividends and splits, there is no need to take this into account during pre-processing or training.
Data visualisation
Distribution of ticker lengths, the number of data points, of all tickers.
Trading volume is an indicator of the liquidity of stocks on a given day. Below is the sum of trading volumes for each day starting 2016–01–01. Between March 2016 and July 2016 multiple zero volume trading days are in the set.
Trading volume and price difference
Is there is a correlation between the trading volume and the price difference? Following plot is of percentage difference, shown as a ratio, of the previous close value compared to the relative trading volume. Trading volume is min-max scaled between 0 and 1.
Distribution of price differences
Histogram below show the distribution of price changes compared to the previous close. It shows a normal distribution with a mean close to 0, only taking regular trading days into account with 0.2 or less difference.
The mean of the difference ratio is 0.0002986971472252537, corresponding to an average daily price increase of roughly 0.191% over all tickers combined. It has a standard deviation of 0.027507475664955195.
Over an entire year, using the daily gain, one would expect a return of
(1 +0.0002986971472252537 ) ** 252 = 1.0781649133392563
or 7.8%. 252 is the number of trading days in a year.
METHODOLOGY
Data preprocessing
The data is quite clean, there are no missing values or strange outliers.
To train on multiple tickers at once, a few steps are necessary during loading.
- Min-max-scaling
- Encoding ticker variable
- Create sequences
- Concatenate and create NumPy array
Min-max scaling
For this step, I used Scikit-learn’s MinMaxScaler. For each ticker separately a MinMaxScaler is created and fit_transform is called.
for ticker in tickers:
# min max scale per ticker
scaler = MinMaxScaler()
scalers[ticker] = scaler
data = scaler.fit_transform(sub[input_cols])
The variable sub is the subset of the entire data holding only the data for the current ticker.
Min-max scaling is used because there can be no assumptions on the underlying distribution of data. Min-max scaling will preserve the shape of the distribution.
Encoding ticker variable
The ticker variable is encoded using Pandas’ get_dummies function.
df = pd.get_dummies(df, columns=['ticker'])
Create sequences
To train an LSTM, training sequences must be prepared. For this project sequences of length 60 are chosen. This is also where the data is divided into a training and a test set using a user-set fraction.
# create sets of X and y
for i in range(60, data.shape[0]):
if i < train_split:
X_train.append(data[i-60:i])
y_train.append(data[i, target_cols_idx])
else:
X_test.append(data[i-60:i])
y_test.append(data[i, target_cols_idx])
Concatenate and create NumPy array
The final step before training can be done, all data must be in the right format. This is done using NumPy’s functions to create and reshape arrays.
X_train = np.array(X_train)
y_train = np.array(y_train)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], len(input_cols+tick_cols)))X_test = np.array(X_test)
y_test = np.array(y_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], len(input_cols+tick_cols)))
Implementation
The code is first created in several Jupyter notebooks and then made into several python scripts.
The Python script can be called from the command line as follows:
usage: train.py [-h] [--tickers TICKERS] [--dates DATES] [--epochs EPOCHS] [--lstm LSTM] [--dropout DROPOUT]Train a LSTM network on up to 5 tickers and a selected date rangeoptional arguments:
-h, --help show this help message and exit
--tickers TICKERS, -t TICKERS
tickers to train on, separated by commas
--dates DATES, -d DATES
start and end date for training data selection, separated by a comma
--epochs EPOCHS, -e EPOCHS
number of epochs to train for
--lstm LSTM, -l LSTM number of LSTM layers
--dropout DROPOUT dropout ratio
When using the GUI, you won’t need to specify these arguments manually, the GUI will do it for you.
LSTM
The following snippet is used to create the LSTM and can be found in train.py
regressor = Sequential()
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
regressor.add(Dropout(dropout))
for i in range(n_lstm-2):
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(dropout))
regressor.add(LSTM(units=50))
regressor.add(Dropout(dropout))
regressor.add(Dense(units=1))
regressor.compile(optimizer='adam', loss='mean_squared_error')
regressor.fit(X_train, y_train, epochs=epochs, batch_size=32)
return regressor
The above code creates at least two LSTM layers and puts a fully connected layer at the end. Depending on the chosen hyperparameter for n_lstm, more LSTM layers can be created.
The selected optimizer is adam, the selected loss is mean squared error.
Multiple architectures with multiple sets of hyperparameters have tested.
Metrics
The following snippet is used to calculate the metrics for each ticker for each training run. Similar calculations are done for each of the baselines. Code can be found in evaluate_model() in train.py.
for ticker in ticker_data:
# find first entry of ticker in test data
for idx, seq in enumerate(X_test):
m = seq[0, ticker['idx']]
if m == 1:
start = idx
break
y_pred = model.predict(X_test[start:start + n_days])
sc = scalers[ticker['name']]
y_pred = inverse_scale_col(sc, y_pred)
y = inverse_scale_col(sc, y_test[start:start + n_days])
base_step = y_train[-1] - y_train[-2]
preds = {
'y_pred': y_pred,
'baseline_flat': np.array([y_train[-1] for _ in range(len(y))]),
'baseline_last': np.array([y_train[-1] + base_step * (1 + x) for x in range(len(y))]),
}
Furthermore, specific functions are available in train.py.
Data loading
def load_stocks(tickers, date_range=None):
"""loads data for multiple tickers and combines them into a DataFrame
date range is inclusive
"""
Data preparation
def prepare_data(df, test_fraction=0.2):
"""Prepares the data for training and testing
Inputs are:
'Open', 'High', 'Low', 'Close', 'Volume'
+ one additional column for each ticker (one-hot)
Target is Open column
"""
Scaling
def inverse_scale_col(scaler, arr):
"""Use a scaler to invert a single column, even if scaler expects a different shape"""
Save plots
def save_plots(model, X_test, y_test, ticker_data, scalers, target_dir, n_days=14):
"""Generate predictions on test set and create plots"""
Save logs
def save_log(tickers, date_range, start_time, metrics, epochs, target_dir):
"""Save a log about the training session"""
Hyperparameters used
The square brackets after each hyperparameter show the tested values.
- number of epochs [1, 5, 10, 15, 20, 30, 60, 120]
- number of LSTM layers [4,5,6]
- dropout ratio[0.15,0.2,0.25]
- tickers [many, many, different tickers, too long a list to name here]
- number of tickers [1,2,3,4,5]
- date ranges [[2015–01–01 to 2017–06–01], [2015–01–01 to 2017–01–01], [2014–01–01 to 2017–01–01], [2013–01–01 to 2017–01–01]]
Top 5 results are shown below:
epochs layers drop n_ticker date_range err MAPE
120 4 0.2 5 2014-2017 0.6614554 0.0106266
120 4 0.2 5 2014-2017 0.6893753 0.0110751
120 4 0.2 5 2014-2017 0.7002434 0.0112497
60 4 0.2 5 2013-2017 0.7234054 0.0116218
120 5 0.2 5 2014-2017 0.7235355 0.0116239
Metrics are averaged over all tickers, sorted by err.
Columns explained:
- epochs: number of epochs
- layers: number of LSTM layers in the model
- drop: dropout ratio
- n_ticker: number of tickers, NOTE: different tickers may be selected
- date_range: start and end year, each of the entries had start and end date of January first of the specified year
- err: mean (average absolute error) over tickers
- MAPE: mean (mean average percentage error) over tickers
RESULTS
Model evaluation and validation
For each model, the error is calculated in terms of absolute error and percentage error for each ticker for each day for 14 days. The overview shows only the absolute error per ticker, and averages over 1, 5 or 14 days.
From the error metrics, we can distinguish if the model would help us. Try to predict some stock on your own to see what your error would be, and then compare it to the performance of the model, is it better?
aaba.us aac.us aame.us aapl.us googl.us
err 0.279148 0.374221 0.075230 0.303202 2.275476
MAPE 0.006654 0.020350 0.020446 0.002854 0.002829
err_base_flat 0.495000 0.494286 0.087314 0.466429 3.239286
MAPE_base_flat 0.011777 0.026910 0.023533 0.004390 0.004039
err_base_last 0.827143 0.776429 0.123129 0.767143 5.298571
MAPE_base_last 0.019693 0.042080 0.033082 0.007222 0.006605
The table above shows the average absolute error and the average mean absolute percentage error for the optimal model with four LSTM layers (determined by the same metrics).
The model is able to beat both baselines for every ticker and for both metrics.
For reference, below an example of the model trained with the exact same hyperparameters except having 1 more LSTM layer (5 LSTM layers total)
aaba.us aac.us aame.us aapl.us googl.us
err 0.357828 0.314343 0.093580 0.353446 2.633914
MAPE 0.008511 0.016956 0.025131 0.003324 0.003287
err_base_flat 0.495000 0.494286 0.087314 0.466429 3.239286
MAPE_base_flat 0.011777 0.026910 0.023533 0.004390 0.004039
err_base_last 0.827143 0.776429 0.123129 0.767143 5.298571
MAPE_base_last 0.019693 0.042080 0.033082 0.007222 0.006605
Above error table shows the model is worse than using 4 LSTM layers on all tickers, with the exception of the ticker aac.us.
Validation and test set
A split for training and testing is made such that training data always precedes testing data chronologically. This is done to prevent a model from learning from the future.
All reported error measures are calculated on the performance of the model on the testing set.
Best model
From the final evaluation page, some trial and error and comparing all the metrics, we can see that the best model is trained on five different tickers for 120 epochs. Training on fewer tickers or for fewer epochs hurts the performance of the model.
Justification
Looking at many plots and metric values of lots of different runs, using a trial-and-error approach we converge on a final model and training parameters.
Experimentation has been done, trying out many different hyperparameter sets of the following values:
- number of epochs
- number of LSTM layers
- dropout ratio
- tickers and number of ticker
- date ranges
The final parameters were:
Tickers
[‘googl.us’, ‘aapl.us’, ‘aaba.us’, ‘aac.us’, ‘aame.us’]
Date range
‘start_date’: ‘2015–01–01’,
‘end_date’: ‘2017–01–01’,
Epochs
120
CONCLUSION
Summary
First I’ve acquired the data, then I’ve done EDA and preprocessing to get it ready for machine learning. Then I worked on creating an LSTM model and trained it on the processed data, I added metrics to evaluate the performance of the model. When this was all working well I started work on the GUI. I chose PySimpleGUI to create a GUI in Python to select data to train on and choose the hyperparameters. When this was all in place I ran many experiments to find out which settings work best.
Furthermore, I made an HTML overview generator to easily see which models have been trained and compare their performance. Finally I wrote this blog post.
Reflection
It was fun working on the code for this project, I’ve been able to try out a new library that I was looking forward to, PySimpleGUI. Preprocessing data and creating a deep-learning model came to live when looking at the predictions and watching it beat the baseline. Then here we’re finally putting it all together using GitHub and Medium.
Improvement
Many more different experiments with different date ranges can be run, trying out different model architectures, the number of epochs and selected tickers to find the optimal combination. A grid search could be used to check many more of these parameter settings.
The overview can be improved to be dynamic and interactive, so users could look into whichever interests them most.
Another fun thing to add is a simulation that pretends to make trades based on the predictions, would the model make you any money?
Conclusion
This was definitely a fun project to work on. There are lots of ways to improve the model, the GUI or the overview. Which of those has priority will largely depend on the users and the business case. We will just have to see what’s next!
DELIVERABLES
This blog, this GitHub repo.