Trading Strategies Using Machine Learning (Part 2)

It's only fair to share...Share on Facebook0Share on LinkedIn0Share on Google+0Tweet about this on Twitter0

Welcome to Part 2 of the Trading Strategies using Machine Learning series. Learning up to this point, have collected the data, compiled it into a single data from, and visualized the data using a correlation matrix. As mentioned previously (via e-mail blast), most stocks are correlated with one another. Knowing this information, you can create trading strategies that depend on market trends.

The Full Code up until this point is here.

First, we’re going to make the following imports.

What Are We Predicting?

Many of these packages are import for our model, but I will explain them all in greater detail later. First, what we need to do is try to establish what our problem. We want to use machine learning to predict, but what are we trying to predict. Maybe we are trying to predict the percentage change in the stock over time; maybe we want to predict the stock price over time.

One thing that is important when working with machine learning algorithms is setting up our features and our labels.

Features: Features are attributes for a given model. They help predict a specific value that you are interested in. These can be a single column worth of data or multiple columns.

Labels: Labels are what you are trying to predict. A label is always a single column.

The entire idea of Machine Learning (at least, the supervised learning that we are utilizing) is that the variables or attributes is going to help you predict or forecast a certain value in the future.

Now that we’ve grabs some stock market data, we can use it to create new data through manipulation and begin the machine learning process for regression.

For our case, we are trying to predict price. So the price will be our labels. You would be correct to assume that our labels would be the current price, but it’s not only the current price. It’s also the difference between the high and the low for that day, as well as the percentage change in volatility. Let’s go ahead and create some new labels.

We defined or forecast column ‘forecast_col’ and then we fill in NaN (Not a Number) values with -99999. There are many different ways for dealing with missing values, but we’re going to use this method for now. This is because many Machine Learning algorithms will simply treat this values as an outlier.

Next, we define what we want to forecast out. We’ll assume all current columns are our features, so we’ll add new column with a simple pandas operation.

Training and Testing

Training Data is labeled data used to train your machine learning algorithms and increase accuracy. The Test Set is what the machine learning model needs to be tested in the real world to measure how robust the predictions are. This is data that it has never seen before. It’s no different from a student who comes across a fresh batch of exam problems. Models¬† also need to be challenged to evaluate their performance.

Every data set is unique in terms of its content. With a fair bit of domain knowledge, one should decide how to split their data set into training and test pairs. The ratio of the split is usually around 80:20 or 75:25 depending on how rigorously you want to test the performance of your models.

In machine learning, the typical standard in the code is to define X (capital X), as the features, and y (lowercase y) as the label.

Our features are everything in the data frame, EXCEPT the label column. This is the reason why we used the drop method that can be applied to data frames, which will return a new data frame without the labels. We define our y variable, which is our label, as simply the label column of the data frame.

Next, we are going to scale or data with pre-processing. It is generally better to have your features in machine learning to be in a range of -1 to 1. This can help improve the accuracy.

Cross Validation

Since training data is used to evaluate the performance of our model, your model will likely suffer from over-fitting. Over-fitting occurs when your model fits too closing to the historical data — or training data in our cast –to the point where it fails to generalize new information. Your training data metrics will be misleading about model performance.

If you fail to keep any separate test data and use all your data to train, you will not know how well or badly your model performs on new unseen data. This is one of the major reasons why well trained ML models fail to make any meaningful predictions on live data.

To solve this issue, we can create a separate validation for the data set. Using a package called “Cross Validation,” you can train on the training data, evaluate the performance on the validation data, optimize the model until you are happy with the performance, and finally test the data on the test to improve the model.

The only issue is that once this is done, you can’t go back and optimize your model further. Going back to the student preparing for the exam, think of this like a student who spent the last couple of weeks using the review guide to study, only to find that the review guide was the entire exam. The algorithm will give you a more favorable result each time because it has already seen the data previously.

From there, we run the following code.

So the confidence level that I received when I ran this test was 0.9841812159851179. You might receive a different result. It may sound pretty good, but it’s not as good as you may think. Stock prices tend to move in such a random fashion that it is still difficult to trade based on this result. Still, it’s better than other algorithms out there. Let’s look at some other models that we could potentially use.

Model Selection

The choice of the model will usually depend on the problem of the situation. Sometimes, the patterns are known to the user and you will use certain known patterns to predict certain things (supervised learning). Sometimes the patterns will be unknown and you will use a machine to learn these unknown patterns (unsupervised learning).

Maybe you’ll need a regression (using the relationship to predict a price in the future); maybe you’ll need a classification (predicting price based on a certain qualitative criteria).

 Supervised vs Unsupervised

Regression vs Classification

Some commonly used supervised learning algorithms to get you started are:

Linear Regression

Logistic Regression

K Nearest Neighbors

Support Vector Machines (SVM)

We can try various machine learning algorithms to see how they test out against one another. Let’s try a few:

The loop gives us the following result:

As we can see, the linear model performed the best. This makes sense because FaceBook is a linear stock.

Now let’s say that we have tested the data and we are happy with our result. Now the next step is to forecast.

Forecasting and Visualization

First we take the data, pre-process it, and then we split it up into a test set and training set. Our X_lately variable contains the most recent features, which we’re going to predict against new data. We will use the predict function to forecast our training set and print out our forecast.

So the data should look something like this:

So these are our forecast. Now what? We’re basically done, but the least we can do is visualize our data.

And there you have it; Apple’s forecast within the next 7 days. This just scratches the surface with what you can do using Machine Learning with Python.

274 total views, 2 views today


Andre is a senior at Pace University, studying Finance and Economics. His topics of interests include Equity Valuation, Short Ideas, Macro, Monetary Economics, ETFs, Forex and Fixed Income. His sectors of choice involve Technology, Consumer Goods, and Financials. Andre has completed a number of internships at leading investment banks domestically and abroad.