Quantification#

import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

import sys
sys.path.append("../../")

# import our modules
from news import fetch_news_headline
from profit import calc_profit

Load Data#

stocks: pd.DataFrame = pd.read_csv("../../data/stocks.csv", index_col=0, parse_dates=True)
company = "TSLA"
stock = stocks.query(f"Company == '{company}'").drop(columns=["Company", "Sector"])
stock

	Open	High	Low	Close	Volume
Date
2017-11-02	20.008667	20.579332	19.508667	19.950666	296871000
2017-11-03	19.966667	20.416668	19.675333	20.406000	133410000
2017-11-06	20.466667	20.500000	19.934000	20.185333	97290000
2017-11-07	20.068001	20.433332	20.002001	20.403334	79414500
2017-11-08	20.366667	20.459333	20.086666	20.292667	70879500
...	...	...	...	...	...
2022-10-26	219.399994	230.600006	218.199997	224.639999	85012500
2022-10-27	229.770004	233.809998	222.850006	225.089996	61638800
2022-10-28	225.399994	228.860001	216.350006	228.520004	69152400
2022-10-31	226.190002	229.850006	221.940002	227.539993	61554300
2022-11-01	234.050003	237.399994	227.279999	227.820007	62566500

1258 rows × 5 columns

Combine Stocks and News#

Intuitively, the news information can only provide us with very vague information about the stock price. In fact, it is already good enough if today’s news can tell us whether the stock price will increase tomorrow.

# shifted percentage change: (tomorrow - today) / today
df = stock[["Close"]].pct_change().shift(-1).dropna()

# whether the stock price will increase compared with today's price
df = df > 0
df.rename(columns={"Close": "Will Go Up?"}, inplace=True)
df

	Will Go Up?
Date
2017-11-02	True
2017-11-03	False
2017-11-06	True
2017-11-07	False
2017-11-08	False
...	...
2022-10-25	True
2022-10-26	True
2022-10-27	True
2022-10-28	False
2022-10-31	True

1257 rows × 1 columns

Now, we attach the news headline on each date to the data frame:

DATABASE_FILEPATH = "../../data/news.db"

df["Headline"] = np.nan

for date in df.index:
    # correct date format
    date_str = datetime.strftime(date, "%Y-%m-%d")
    
    # fetch news headline from database
    headline = fetch_news_headline(DATABASE_FILEPATH, company, date_str)
    
    # attch news headline to data frame
    df.loc[date, "Headline"] = headline

df = df[["Headline", "Will Go Up?"]].dropna()
df

	Headline	Will Go Up?
Date
2017-11-02	Okay Elon Musk, what's gone wrong with the Tes...	True
2017-11-03	Tesla hits bumps in pursuit of mass market	False
2017-11-06	Tesla Investors May Be Losing Patience	True
2017-11-07	Bloomberg	False
2017-11-08	Elon Musk met with Erdogan to discuss Tesla's ...	False
...	...	...
2022-10-25	Update: Tesla Cuts Prices in China After Signs...	True
2022-10-26	Exclusive: Tesla faces U.S. criminal probe ove...	True
2022-10-27	From Tesla to SpaceX, what Elon Musk touches t...	True
2022-10-28	Wall Street asks if Musk can manage Twitter, T...	False
2022-10-31	Elon Musk has pulled more than 50 Tesla employ...	True

1068 rows × 2 columns

Split training and test data sets:

num_days_left_out = 90
train_df = df[:-num_days_left_out]
test_df = df[-num_days_left_out:]

Word Embedding#

Our goal is to predict whether the stock price will go up tomorrow based on today’s news. So, it can be treated as a classification problem, more precisely, a binary classification problem. That is, we want to find a mapping / decision rule:

\[ \text{News Headline (Text)} \mapsto \{\text{True}, \text{False}\} \]

But the headlines consist of text data. Hence, we need to first transform or encode them to numerical data. This procedure is called word embedding.

Usually, we need to train an additional model for word embedding, which is quite troublesome. To reduce our workload, fortunately, there is a pretrained model prodived by the module sentence_transformers that is ready to use.

from sentence_transformers import SentenceTransformer

Binary Classification Problem using SVM#

As mentioned, we want to solve a classification problem.

In terms of machine learning terminologies, each sample \((\mathbf{x}, y)\) consists of a vector \(\mathbf{x}\) representing 384 features (encoded news headline) and a binary label \(y \in \{\text{True}, \text{False}\}\).

The most popular choice is the Support Vector Machine (SVM).

from sklearn.svm import SVC

# prepare training data
X_train = embeddings
y_train = train_df["Will Go Up?"].to_numpy()

# train!
clf = SVC(C=1, kernel="rbf")
clf.fit(X_train, y_train)

SVC(C=1)

Prepare the test data follows exactly the same procedure as before:

X_test = headline_encoder.encode(test_df["Headline"].tolist())
y_test = test_df["Will Go Up?"].to_numpy()

Predict the whether price will go up or down:

y_pred = clf.predict(X_test)

ROC curve and AUC score:

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr, linewidth=2) 
plt.plot([0, 1], [0, 1], color="k", linestyle="--")
plt.grid()
plt.show()

roc_auc_score(y_test, y_pred)

0.5316077650572424

Unfortunately, the value of AUC is only slightly grater than 0.5, which means the classifier not very satisfactory.

Profit#

Decide the dates to buy and sell:

# Find buying and selling dates
buy_dates = []
sell_dates = []
is_holding_shares = False
for i, date in enumerate(test_df.index):
    will_go_up = y_pred[i]
    if is_holding_shares:
        # price will go down, sell it!
        if not will_go_up:
            sell_dates.append(date)
            is_holding_shares = False
    else:
        # price will go up, buy!
        if will_go_up:
            buy_dates.append(date)
            is_holding_shares = True

# convert to Pandas `DatetimeIndex`
buy_dates = pd.DatetimeIndex(buy_dates)
sell_dates = pd.DatetimeIndex(sell_dates)

buy_dates, sell_dates

(DatetimeIndex(['2022-06-27', '2022-07-05', '2022-07-07', '2022-07-12',
                '2022-07-14', '2022-07-19', '2022-07-26', '2022-08-01',
                '2022-08-05', '2022-08-11', '2022-09-01', '2022-09-12',
                '2022-09-14', '2022-09-20', '2022-09-26', '2022-09-29',
                '2022-10-07', '2022-10-11', '2022-10-25'],
               dtype='datetime64[ns]', freq=None),
 DatetimeIndex(['2022-07-01', '2022-07-06', '2022-07-11', '2022-07-13',
                '2022-07-15', '2022-07-21', '2022-07-27', '2022-08-04',
                '2022-08-09', '2022-08-23', '2022-09-08', '2022-09-13',
                '2022-09-19', '2022-09-22', '2022-09-27', '2022-10-03',
                '2022-10-10', '2022-10-19', '2022-10-27'],
               dtype='datetime64[ns]', freq=None))

How much can we earn if the strategy is entirely news-driven?

start_date = test_df.index[0]
profit_rate = calc_profit(stock, buy_dates, sell_dates, start_date)
print(f"Profit rate: {profit_rate:.2%}")

Profit rate: 3.25%

The profit rate is only 3.25%, which lower than both SMA and prediction-based (LSTM) strategies.

Conclusion#

As we can see, it is not adequate to predict whether the stock price will go up or down tomorrow only based on the news. More delicate techniques are required in future works!

Making Money in Stocks

Quantification

Contents

Quantification#

Load Data#

Combine Stocks and News#

Word Embedding#

Binary Classification Problem using SVM#

Profit#

Conclusion#