Using an LSTM for stock prediction

Sijmen van der Willik
12 min readOct 10, 2020

--

Disclaimer: This blog post is written as part of the Udacity Capstone project for Data Scientist Nanodegree. Accompanying code and README.md can be found here.

Introduction

Wouldn’t we all like to be able to predict the stock market? There’s lots of money involved in trading stocks each day, a day’s trading on NASDAQ regularly exceeds two hundred trillion US dollars. The ones who are able to predict the direction of a given stock the best will take home the biggest prices.

PROJECT DEFINITION

Project Overview

In this project, we will investigate if a deep learning model, an LSTM to be precise, can help us predict the direction of a given stock. For this, we will have to find a dataset about stocks and pre-process this data. Furthermore, we will have to create a model and train it. Lastly, we will evaluate the model to see if it actually helps us.

General approach

The project can be roughly divided into three different parts:
1. LSTM
2. GUI
3. Overview

LSTM

For the first part, the LSTM, I used a KDNugget article as a quick-start guide to LSTMs for stock prediction. Making a few modifications in the data loading it was ready for a first run.

Predicted versus actual stock price, time is in days after last known data point.

I was positively surprised by the performance of a model this simple! There are only a few layers, I can easily train it on my laptop, and I haven’t enriched the data or done any feature engineering. The model seems able to correctly identify up and down trends.

After a few more test runs and a slight tweaking, the LSTM part was done.

GUI

Not too long ago I came across a post about a Python GUI library that I hadn’t used before called PySimpleGUI. Seeing how I’ve never found TKinter a very intuitive way to design a GUI using Python, I was eager to try this out.

The experiment was a success, it’s easy to get started and it’s intuitive to expand. For this case, I added a few dropdown boxes to select any of the available tickers and three text boxes to input the dates and number of epochs. The “Train!” button starts the training script with the given parameters, and “Exit”… exits.

The final result of the Python GUI using PySimpleGUI.

Trained models overview

I could build a web app to with a neat interface to show all kinds of metrics, but since this is just a hypothetical scenario I opted for something simpler. I created a bootstrap HTML template and replaced parts with model data from a Python script. It’s not the cleanest or most robust solution, but it definitely does the job for now.

Bootstrap model overview

Problem statement

Create an interface to train an LSTM to predict stock values in the future. User must be able to select any date range, tickers and number of epochs. Finally, there must be a way to easily compare trained models.

Metrics

Two metrics are used:
— absolute error
— percentage error

Absolute error is calculated as follows:

absolute_error = np.abs(y_hat - y)

Percentage error is calculated as follows:

percentage_error = np.abs(y_hat - y) / y

In both calculations, y is the ground truth.

These two metrics are also calculated for two baseline predictions to estimate the actual performance of the model.

The baselines are:

  1. Predict last known value for each next value
  2. Predict linearly from the last two known values for each next value
Example plot of predictions by an underfitted LSTM model, the two baseline metrics and the ground truth.

Why?

These metrics are chosen because the absolute error conveys one thing pretty well: difference with ground truth. A dollar is a dollar, so it’s very useful to know just how far you’re off.

On the other hand, you’re also interested in the difference compared to the size of your investment, this is where MAPE comes in. Since the metric is calculated using a percentage error from the ground truth, it’s very easy to see how much you’re off compared to the entire asset, assuming a single stock.

ANALYSIS

Data Exploration

The dataset can be found on Kaggle. The dataset holds data from the start of each of 7195 different tickers until 2017–11–10. This also means not all tickers have the same amount of data available. Of all tickers, 7163 have data, 32 listed tickers have no data points associated with them.

For each ticker and day, the following features are available:

  1. Date, trading day
  2. Open, trading day’s opening value
  3. High, trading day’s maximum
  4. Low, trading day’s minimum
  5. Close, trading day’s closing value
  6. Volume, volume trading during the entire trading day

The ticker with the most data points is ngl.us. Having one data point each day for every day between 1962–01–02 and 2017–11–10, totalling 14059 data points.

The most common data length is 3201 data points, with 1690 tickers sharing this amount.

Missing values

There are no missing values in the dataset, besides the empty ticker mentioned above. However, since the stock market is closed during the weekends, there are no values available for these days.

In the creation of the training and test sets, there is no distinction between these days, Monday follows directly after the Friday of the day before.

Stock splits and reverse stock splits

The data in the dataset has been adjusted for dividends and splits, there is no need to take this into account during pre-processing or training.

Data visualisation

Distribution of ticker lengths, the number of data points, of all tickers.

Distribution of ticker lengths, the x-axis shows the ticker length, the y-axis shows the occurrence.

Trading volume is an indicator of the liquidity of stocks on a given day. Below is the sum of trading volumes for each day starting 2016–01–01. Between March 2016 and July 2016 multiple zero volume trading days are in the set.

Total (sum) trading volume for each day starting 2016–01–01.

Trading volume and price difference

Is there is a correlation between the trading volume and the price difference? Following plot is of percentage difference, shown as a ratio, of the previous close value compared to the relative trading volume. Trading volume is min-max scaled between 0 and 1.

The y-axis shows the difference in ratio, -1 meaning a complete loss where 1 is a 100% gain compared to the previous close. The x-axis is the scaled trading volume, where 0 is no trading volume and 1 is the maximum trading volume in the history of a given stock.

Distribution of price differences

Histogram below show the distribution of price changes compared to the previous close. It shows a normal distribution with a mean close to 0, only taking regular trading days into account with 0.2 or less difference.

The y-axis shows the frequency, the x-axis shows the ratio of the Close price compared to previous day’s Close.

The mean of the difference ratio is 0.0002986971472252537, corresponding to an average daily price increase of roughly 0.191% over all tickers combined. It has a standard deviation of 0.027507475664955195.

Over an entire year, using the daily gain, one would expect a return of

(1 +0.0002986971472252537 ) ** 252 = 1.0781649133392563

or 7.8%. 252 is the number of trading days in a year.

METHODOLOGY

Data preprocessing

The data is quite clean, there are no missing values or strange outliers.

To train on multiple tickers at once, a few steps are necessary during loading.

  1. Min-max-scaling
  2. Encoding ticker variable
  3. Create sequences
  4. Concatenate and create NumPy array

Min-max scaling

For this step, I used Scikit-learn’s MinMaxScaler. For each ticker separately a MinMaxScaler is created and fit_transform is called.

for ticker in tickers:
# min max scale per ticker
scaler = MinMaxScaler()
scalers[ticker] = scaler
data = scaler.fit_transform(sub[input_cols])

The variable sub is the subset of the entire data holding only the data for the current ticker.

Min-max scaling is used because there can be no assumptions on the underlying distribution of data. Min-max scaling will preserve the shape of the distribution.

Encoding ticker variable

The ticker variable is encoded using Pandas’ get_dummies function.

df = pd.get_dummies(df, columns=['ticker'])

Create sequences

To train an LSTM, training sequences must be prepared. For this project sequences of length 60 are chosen. This is also where the data is divided into a training and a test set using a user-set fraction.

# create sets of X and y
for i in range(60, data.shape[0]):
if i < train_split:
X_train.append(data[i-60:i])
y_train.append(data[i, target_cols_idx])
else:
X_test.append(data[i-60:i])
y_test.append(data[i, target_cols_idx])

Concatenate and create NumPy array

The final step before training can be done, all data must be in the right format. This is done using NumPy’s functions to create and reshape arrays.

X_train = np.array(X_train)
y_train = np.array(y_train)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], len(input_cols+tick_cols)))
X_test = np.array(X_test)
y_test = np.array(y_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], len(input_cols+tick_cols)))

Implementation

The code is first created in several Jupyter notebooks and then made into several python scripts.

The Python script can be called from the command line as follows:

usage: train.py [-h] [--tickers TICKERS] [--dates DATES] [--epochs EPOCHS] [--lstm LSTM] [--dropout DROPOUT]Train a LSTM network on up to 5 tickers and a selected date rangeoptional arguments:
-h, --help show this help message and exit
--tickers TICKERS, -t TICKERS
tickers to train on, separated by commas
--dates DATES, -d DATES
start and end date for training data selection, separated by a comma
--epochs EPOCHS, -e EPOCHS
number of epochs to train for
--lstm LSTM, -l LSTM number of LSTM layers
--dropout DROPOUT dropout ratio

When using the GUI, you won’t need to specify these arguments manually, the GUI will do it for you.

LSTM

The following snippet is used to create the LSTM and can be found in train.py

regressor = Sequential()

regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
regressor.add(Dropout(dropout))

for i in range(n_lstm-2):
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(dropout))

regressor.add(LSTM(units=50))
regressor.add(Dropout(dropout))

regressor.add(Dense(units=1))

regressor.compile(optimizer='adam', loss='mean_squared_error')

regressor.fit(X_train, y_train, epochs=epochs, batch_size=32)

return regressor

The above code creates at least two LSTM layers and puts a fully connected layer at the end. Depending on the chosen hyperparameter for n_lstm, more LSTM layers can be created.

The selected optimizer is adam, the selected loss is mean squared error.

Multiple architectures with multiple sets of hyperparameters have tested.

Metrics

The following snippet is used to calculate the metrics for each ticker for each training run. Similar calculations are done for each of the baselines. Code can be found in evaluate_model() in train.py.

for ticker in ticker_data:
# find first entry of ticker in test data
for idx, seq in enumerate(X_test):
m = seq[0, ticker['idx']]
if m == 1:
start = idx
break
y_pred = model.predict(X_test[start:start + n_days])

sc = scalers[ticker['name']]
y_pred = inverse_scale_col(sc, y_pred)
y = inverse_scale_col(sc, y_test[start:start + n_days])

base_step = y_train[-1] - y_train[-2]
preds = {
'y_pred': y_pred,
'baseline_flat': np.array([y_train[-1] for _ in range(len(y))]),
'baseline_last': np.array([y_train[-1] + base_step * (1 + x) for x in range(len(y))]),
}

Furthermore, specific functions are available in train.py.

Data loading

def load_stocks(tickers, date_range=None):
"""loads data for multiple tickers and combines them into a DataFrame
date range is inclusive
"""

Data preparation

def prepare_data(df, test_fraction=0.2):
"""Prepares the data for training and testing

Inputs are:
'Open', 'High', 'Low', 'Close', 'Volume'
+ one additional column for each ticker (one-hot)

Target is Open column
"""

Scaling

def inverse_scale_col(scaler, arr):
"""Use a scaler to invert a single column, even if scaler expects a different shape"""

Save plots

def save_plots(model, X_test, y_test, ticker_data, scalers, target_dir, n_days=14):
"""Generate predictions on test set and create plots"""

Save logs

def save_log(tickers, date_range, start_time, metrics, epochs, target_dir):
"""Save a log about the training session"""

Hyperparameters used

The square brackets after each hyperparameter show the tested values.

  • number of epochs [1, 5, 10, 15, 20, 30, 60, 120]
  • number of LSTM layers [4,5,6]
  • dropout ratio[0.15,0.2,0.25]
  • tickers [many, many, different tickers, too long a list to name here]
  • number of tickers [1,2,3,4,5]
  • date ranges [[2015–01–01 to 2017–06–01], [2015–01–01 to 2017–01–01], [2014–01–01 to 2017–01–01], [2013–01–01 to 2017–01–01]]

Top 5 results are shown below:

epochs layers drop n_ticker date_range err       MAPE
120 4 0.2 5 2014-2017 0.6614554 0.0106266
120 4 0.2 5 2014-2017 0.6893753 0.0110751
120 4 0.2 5 2014-2017 0.7002434 0.0112497
60 4 0.2 5 2013-2017 0.7234054 0.0116218
120 5 0.2 5 2014-2017 0.7235355 0.0116239

Metrics are averaged over all tickers, sorted by err.

Columns explained:

  • epochs: number of epochs
  • layers: number of LSTM layers in the model
  • drop: dropout ratio
  • n_ticker: number of tickers, NOTE: different tickers may be selected
  • date_range: start and end year, each of the entries had start and end date of January first of the specified year
  • err: mean (average absolute error) over tickers
  • MAPE: mean (mean average percentage error) over tickers

RESULTS

Model evaluation and validation

For each model, the error is calculated in terms of absolute error and percentage error for each ticker for each day for 14 days. The overview shows only the absolute error per ticker, and averages over 1, 5 or 14 days.

From the error metrics, we can distinguish if the model would help us. Try to predict some stock on your own to see what your error would be, and then compare it to the performance of the model, is it better?

Stock predictions and baseline for the best model
                 aaba.us    aac.us   aame.us   aapl.us  googl.us
err 0.279148 0.374221 0.075230 0.303202 2.275476
MAPE 0.006654 0.020350 0.020446 0.002854 0.002829
err_base_flat 0.495000 0.494286 0.087314 0.466429 3.239286
MAPE_base_flat 0.011777 0.026910 0.023533 0.004390 0.004039
err_base_last 0.827143 0.776429 0.123129 0.767143 5.298571
MAPE_base_last 0.019693 0.042080 0.033082 0.007222 0.006605

The table above shows the average absolute error and the average mean absolute percentage error for the optimal model with four LSTM layers (determined by the same metrics).

The model is able to beat both baselines for every ticker and for both metrics.

For reference, below an example of the model trained with the exact same hyperparameters except having 1 more LSTM layer (5 LSTM layers total)

                 aaba.us    aac.us   aame.us   aapl.us  googl.us
err 0.357828 0.314343 0.093580 0.353446 2.633914
MAPE 0.008511 0.016956 0.025131 0.003324 0.003287
err_base_flat 0.495000 0.494286 0.087314 0.466429 3.239286
MAPE_base_flat 0.011777 0.026910 0.023533 0.004390 0.004039
err_base_last 0.827143 0.776429 0.123129 0.767143 5.298571
MAPE_base_last 0.019693 0.042080 0.033082 0.007222 0.006605

Above error table shows the model is worse than using 4 LSTM layers on all tickers, with the exception of the ticker aac.us.

Validation and test set

A split for training and testing is made such that training data always precedes testing data chronologically. This is done to prevent a model from learning from the future.

All reported error measures are calculated on the performance of the model on the testing set.

Best model

From the final evaluation page, some trial and error and comparing all the metrics, we can see that the best model is trained on five different tickers for 120 epochs. Training on fewer tickers or for fewer epochs hurts the performance of the model.

Justification

Looking at many plots and metric values of lots of different runs, using a trial-and-error approach we converge on a final model and training parameters.

Experimentation has been done, trying out many different hyperparameter sets of the following values:

  • number of epochs
  • number of LSTM layers
  • dropout ratio
  • tickers and number of ticker
  • date ranges

The final parameters were:

Tickers

[‘googl.us’, ‘aapl.us’, ‘aaba.us’, ‘aac.us’, ‘aame.us’]

Date range

‘start_date’: ‘2015–01–01’,
‘end_date’: ‘2017–01–01’,

Epochs

120

CONCLUSION

Summary

First I’ve acquired the data, then I’ve done EDA and preprocessing to get it ready for machine learning. Then I worked on creating an LSTM model and trained it on the processed data, I added metrics to evaluate the performance of the model. When this was all working well I started work on the GUI. I chose PySimpleGUI to create a GUI in Python to select data to train on and choose the hyperparameters. When this was all in place I ran many experiments to find out which settings work best.

Furthermore, I made an HTML overview generator to easily see which models have been trained and compare their performance. Finally I wrote this blog post.

Reflection

It was fun working on the code for this project, I’ve been able to try out a new library that I was looking forward to, PySimpleGUI. Preprocessing data and creating a deep-learning model came to live when looking at the predictions and watching it beat the baseline. Then here we’re finally putting it all together using GitHub and Medium.

Improvement

Many more different experiments with different date ranges can be run, trying out different model architectures, the number of epochs and selected tickers to find the optimal combination. A grid search could be used to check many more of these parameter settings.

The overview can be improved to be dynamic and interactive, so users could look into whichever interests them most.

Another fun thing to add is a simulation that pretends to make trades based on the predictions, would the model make you any money?

Conclusion

This was definitely a fun project to work on. There are lots of ways to improve the model, the GUI or the overview. Which of those has priority will largely depend on the users and the business case. We will just have to see what’s next!

DELIVERABLES

This blog, this GitHub repo.

--

--

Sijmen van der Willik
Sijmen van der Willik

Written by Sijmen van der Willik

Lead Data Science at Isatis Business Solutions

No responses yet