Quantification#

import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

import sys
sys.path.append("../../")

# import our modules
from news import fetch_news_headline
from profit import calc_profit

Load Data#

stocks: pd.DataFrame = pd.read_csv("../../data/stocks.csv", index_col=0, parse_dates=True)
company = "TSLA"
stock = stocks.query(f"Company == '{company}'").drop(columns=["Company", "Sector"])
stock
Open High Low Close Volume
Date
2017-11-02 20.008667 20.579332 19.508667 19.950666 296871000
2017-11-03 19.966667 20.416668 19.675333 20.406000 133410000
2017-11-06 20.466667 20.500000 19.934000 20.185333 97290000
2017-11-07 20.068001 20.433332 20.002001 20.403334 79414500
2017-11-08 20.366667 20.459333 20.086666 20.292667 70879500
... ... ... ... ... ...
2022-10-26 219.399994 230.600006 218.199997 224.639999 85012500
2022-10-27 229.770004 233.809998 222.850006 225.089996 61638800
2022-10-28 225.399994 228.860001 216.350006 228.520004 69152400
2022-10-31 226.190002 229.850006 221.940002 227.539993 61554300
2022-11-01 234.050003 237.399994 227.279999 227.820007 62566500

1258 rows × 5 columns

Combine Stocks and News#

Intuitively, the news information can only provide us with very vague information about the stock price. In fact, it is already good enough if today’s news can tell us whether the stock price will increase tomorrow.

# shifted percentage change: (tomorrow - today) / today
df = stock[["Close"]].pct_change().shift(-1).dropna()

# whether the stock price will increase compared with today's price
df = df > 0
df.rename(columns={"Close": "Will Go Up?"}, inplace=True)
df
Will Go Up?
Date
2017-11-02 True
2017-11-03 False
2017-11-06 True
2017-11-07 False
2017-11-08 False
... ...
2022-10-25 True
2022-10-26 True
2022-10-27 True
2022-10-28 False
2022-10-31 True

1257 rows × 1 columns

Now, we attach the news headline on each date to the data frame:

DATABASE_FILEPATH = "../../data/news.db"

df["Headline"] = np.nan

for date in df.index:
    # correct date format
    date_str = datetime.strftime(date, "%Y-%m-%d")
    
    # fetch news headline from database
    headline = fetch_news_headline(DATABASE_FILEPATH, company, date_str)
    
    # attch news headline to data frame
    df.loc[date, "Headline"] = headline

df = df[["Headline", "Will Go Up?"]].dropna()
df
Headline Will Go Up?
Date
2017-11-02 Okay Elon Musk, what's gone wrong with the Tes... True
2017-11-03 Tesla hits bumps in pursuit of mass market False
2017-11-06 Tesla Investors May Be Losing Patience True
2017-11-07 Bloomberg False
2017-11-08 Elon Musk met with Erdogan to discuss Tesla's ... False
... ... ...
2022-10-25 Update: Tesla Cuts Prices in China After Signs... True
2022-10-26 Exclusive: Tesla faces U.S. criminal probe ove... True
2022-10-27 From Tesla to SpaceX, what Elon Musk touches t... True
2022-10-28 Wall Street asks if Musk can manage Twitter, T... False
2022-10-31 Elon Musk has pulled more than 50 Tesla employ... True

1068 rows × 2 columns

Split training and test data sets:

num_days_left_out = 90
train_df = df[:-num_days_left_out]
test_df = df[-num_days_left_out:]

Word Embedding#

Our goal is to predict whether the stock price will go up tomorrow based on today’s news. So, it can be treated as a classification problem, more precisely, a binary classification problem. That is, we want to find a mapping / decision rule:

\[ \text{News Headline (Text)} \mapsto \{\text{True}, \text{False}\} \]

But the headlines consist of text data. Hence, we need to first transform or encode them to numerical data. This procedure is called word embedding.

Usually, we need to train an additional model for word embedding, which is quite troublesome. To reduce our workload, fortunately, there is a pretrained model prodived by the module sentence_transformers that is ready to use.

from sentence_transformers import SentenceTransformer

See also

Check the document of sentence_transformers for more information.

Encode the news headlines with the model SentenceTransformer:

# load pretrained model
headline_encoder = SentenceTransformer("all-MiniLM-L6-v2")

# encode news headlines to NumPy arrays
embeddings = headline_encoder.encode(train_df["Headline"].tolist())

print(f"embeddings.shape: {embeddings.shape}")
embeddings
embeddings.shape: (978, 384)
array([[-0.00946767,  0.01169717,  0.07283923, ..., -0.07292849,
        -0.02329506,  0.11975066],
       [-0.01846967, -0.02873902,  0.03534317, ..., -0.13485262,
         0.01329796,  0.13287042],
       [ 0.02563943, -0.04956274,  0.04786101, ..., -0.07707552,
        -0.05291679,  0.0715071 ],
       ...,
       [-0.00674385,  0.0027471 ,  0.09396511, ..., -0.09142395,
         0.01973804,  0.08486167],
       [-0.07479677,  0.01877449,  0.07296403, ..., -0.08446505,
         0.02360164,  0.05853827],
       [ 0.01585246,  0.05186599,  0.07435998, ..., -0.07771911,
         0.01394665,  0.05791727]], dtype=float32)

As we can see, each headline is transformed to an array with length 384.

Binary Classification Problem using SVM#

As mentioned, we want to solve a classification problem.

In terms of machine learning terminologies, each sample \((\mathbf{x}, y)\) consists of a vector \(\mathbf{x}\) representing 384 features (encoded news headline) and a binary label \(y \in \{\text{True}, \text{False}\}\).

The most popular choice is the Support Vector Machine (SVM).

from sklearn.svm import SVC

# prepare training data
X_train = embeddings
y_train = train_df["Will Go Up?"].to_numpy()

# train!
clf = SVC(C=1, kernel="rbf")
clf.fit(X_train, y_train)
SVC(C=1)

Prepare the test data follows exactly the same procedure as before:

X_test = headline_encoder.encode(test_df["Headline"].tolist())
y_test = test_df["Will Go Up?"].to_numpy()

Predict the whether price will go up or down:

y_pred = clf.predict(X_test)

ROC curve and AUC score:

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr, linewidth=2) 
plt.plot([0, 1], [0, 1], color="k", linestyle="--")
plt.grid()
plt.show()
../_images/quantification_30_0.png
roc_auc_score(y_test, y_pred)
0.5316077650572424

Unfortunately, the value of AUC is only slightly grater than 0.5, which means the classifier not very satisfactory.

Profit#

Decide the dates to buy and sell:

# Find buying and selling dates
buy_dates = []
sell_dates = []
is_holding_shares = False
for i, date in enumerate(test_df.index):
    will_go_up = y_pred[i]
    if is_holding_shares:
        # price will go down, sell it!
        if not will_go_up:
            sell_dates.append(date)
            is_holding_shares = False
    else:
        # price will go up, buy!
        if will_go_up:
            buy_dates.append(date)
            is_holding_shares = True

# convert to Pandas `DatetimeIndex`
buy_dates = pd.DatetimeIndex(buy_dates)
sell_dates = pd.DatetimeIndex(sell_dates)

buy_dates, sell_dates
(DatetimeIndex(['2022-06-27', '2022-07-05', '2022-07-07', '2022-07-12',
                '2022-07-14', '2022-07-19', '2022-07-26', '2022-08-01',
                '2022-08-05', '2022-08-11', '2022-09-01', '2022-09-12',
                '2022-09-14', '2022-09-20', '2022-09-26', '2022-09-29',
                '2022-10-07', '2022-10-11', '2022-10-25'],
               dtype='datetime64[ns]', freq=None),
 DatetimeIndex(['2022-07-01', '2022-07-06', '2022-07-11', '2022-07-13',
                '2022-07-15', '2022-07-21', '2022-07-27', '2022-08-04',
                '2022-08-09', '2022-08-23', '2022-09-08', '2022-09-13',
                '2022-09-19', '2022-09-22', '2022-09-27', '2022-10-03',
                '2022-10-10', '2022-10-19', '2022-10-27'],
               dtype='datetime64[ns]', freq=None))

How much can we earn if the strategy is entirely news-driven?

start_date = test_df.index[0]
profit_rate = calc_profit(stock, buy_dates, sell_dates, start_date)
print(f"Profit rate: {profit_rate:.2%}")
Profit rate: 3.25%

The profit rate is only 3.25%, which lower than both SMA and prediction-based (LSTM) strategies.

Conclusion#

As we can see, it is not adequate to predict whether the stock price will go up or down tomorrow only based on the news. More delicate techniques are required in future works!