Last Updated on February 22, 2022 by Jay
In this article, I will walk through how I built a Bitcoin price prediction in Python using stock to flow model and linear regression. With this powerful model, we can predict the Bitcoin price at the next halving in 2024. I’m going to walk through the prediction. Feel free to follow along and try it yourself.
I will try to make this tutorial easy to follow while covering everything you need to know or understand, so bear with me if I miss any details. If you want me to clarify anything, leave a comment below, and I’ll try my best to answer your questions.
Note this is not financial advice. I’m only sharing for educational purposes.
This exercise was inspired by PlanB, who wrote a blog post about Bitcoin price. I wanted to see if I could replicate his result, so I’m documenting everything I’ve done in this post.
Credits to PlanB for sharing his knowledge & method for the prediction.
Dataset
We are going to use a dataset that’s downloaded directly from the Bitcoin blockchain. The data file is about 400MB in size. You can learn how to download it from this tutorial. Do not download the blockchain from a single source if you want authentic data as I suggested in the blog post. Therefore I also don’t be providing the original data here. However, I’m providing the final dataset used for the linear regression model to help you try the code and follow along easier.
Libraries
- sqlite3 – extract data from the SQLite database
- pandas – transform and data wrangling
- sklearn – a machine learning library, we’ll use the linear regression from here
- matplotlib – for visualizing the data points
Bitcoin Stock To Flow Model
Below is a summary of the stock to flow model:
- Scarcity can be quantified by SF (stock to flow). Precious metal like gold or silver can also be modelled using SF.
- SF = stock / flow. Stock = the size of the existing stockpiles or reserves. Flow = yearly production amount.
- SF simply means: How many years does it take to produce the same amount of current stockpile, if everything stays constant.
- The higher the SF ratio, the more expensive the asset. For reference, gold has an SF of 62 and $8.4 Trillion market cap; silver has an SF of 22 and $308 Billion market cap.
Okay, enough of theories! Sit tight, we are about to begin the journey!
Getting Blockchain Data
If you followed my previous tutorial, now you should have all the blockchain data stored in a local SQLite database, we’ll read the entire blockchain into Python. Note that for efficiency reasons, we could have selected just the columns required instead of everything. Limiting the data columns also helps if your computer is slow.
import pandas as pd
import sqlite3
conn = sqlite3.connect(r'C:\Users\jay\Desktop\PythonInOffice\bitcoin_price_prediction\bitcoin_blockchain.db')
df = pd.read_sql('select * from blockchain', conn)
df.shape
(723691, 30)
It shows that there are 723,691 blocks in total when I downloaded the Bitcoin blockchain data. It contains lots of information, but for our project, we just need the time
field to figure out when a block was created. The time format is a “Unix timestamp”, which we’ll convert into a human-readable format using the pd.to_datetime()
method, note the unit ='s'
argument informs the function that input data is Unix timestamp. We see that the first block was created on 2009-01-09, which sounds right!
>>> df['time'].head()
0 1231469665
1 1231469744
2 1231470173
3 1231470988
4 1231471428
>>> df['GMT'] = pd.to_datatime(df['time'].head(),unit='s')
>>> df['GMT'].head()
0 2009-01-09 02:54:25
1 2009-01-09 02:55:44
2 2009-01-09 03:02:53
3 2009-01-09 03:16:28
4 2009-01-09 03:23:48
Calculate Bitcoin Rewards & Total Stock Amount
To derive the stock to flow metric for Bitcoin, we need the following:
- Stock – total bitcoin available at any given time
- Flow – annual bitcoin “production” rate
Each block contains some rewards, aka the Bitcoin. The block reward started from 50 BTC per block back in 2009, then the reward amount halves every 210,000 blocks. We can also verify this on the block explorer website. The 210,000th block has 25 BTC as a reward, and the 420,000th block has 12.5 BTC reward, and so on and so forth. At the time of writing (Feb 2022), each block gives 6.25 BTC as a reward.
The reward amount is referred to as the “subsidy” in the blockchain data. The unit of this data is “satoshi”, named after the creator of Bitcoin. 1 satoshi = 0.00000001 BTC (10^-8).
df['reward'] = df['subsidy'] / 100000000
Next, we calculate the total stockpiles by using a cumulative sum function in pandas
. As you can see, at the time of writing, the total stockpile is around ~18.96m BTC.
df['total_supply'] = df['reward'].cumsum()
Calculate Annual Flow Amount
Next, we’ll calculate the annual flow amount, which is the annual production of BTC. The approach I took was to calculate the total number of reward/subsidy for each calendar year. I guess you can also try using the trailing 12-month reward amount. We can use a dictionary comprehension to store the annual reward amount inside a dictionary called reward
.
Read here to understand a list/dictionary comprehension in Python.
df['year'] = df['GMT'].dt.year
reward = {i: df.loc[df['year'] ==i, 'reward'].sum() for i in df['year'].unique()}
>>> reward
{2009: 1624450.0, 2010: 3395900.0, 2011: 2981300.0, 2012: 2612175.0, 2013: 1585800.0, 2014: 1471600.0, 2015: 1358000.0, 2016: 1045837.5, 2017: 699075.0, 2018: 681212.5, 2019: 677887.5, 2020: 453312.5, 2021: 329281.25, 2022: 44418.75}
>>> reward[2022]= reward[2021]
Then let’s put the annual flow amount back into the dataset.
for i in df['year'].unique():
df.loc[df['year'] ==i, 'annual_flow'] = reward[i]
Calculate A Running Bitcoin Stock To Flow Ratio
At the time of writing, it is Feb 17, 2022, so we are only 1.5 months into 2022. For this analysis, I’m going to include 2022 January data and ignore Feb data due to incomplete month. I’ll be updating this post when more data points become available in the future.
There’s no halving event in 2022, so I assume the production rate (reward) amount is the same as 2021. I set the 2022 annual flow amount = 2021 flow amount.
We finally have our precious Stock-To-Flow metric!!!
>>> df['sf_ratio'] = df['total_supply']/df['annual_flow']
Calculate Bitcoin Stock To Flow Ratio At Month-End
We have another problem now – there are over 700 thousand data points (blocks)!! I’m going to use just the monthly data points for the prediction. Feel free to use daily data, I think you’ll arrive at the same conclusion.
Since we have calculated the “running total” for stockpile, we just need to locate the very last block in a given month to find the monthly metrics.
We’ll clean up the dates a little bit by getting rid of the time components and leaving only dates. Then we use df.groupby
on the date column. However, we are not ‘aggregating’ anything per se. We are trying to find the final status for each day, by using the agg
function to find the max
on the total_supply
column. Because we have running totals, the largest amount for each day must be the last status for that given day.
>>> df['GMT_date'] = df['GMT'].dt.date
>>> df_daily = df.groupby('GMT_date', as_index=False).agg({'GMT_date':'first',
'sf_ratio':'max',
'total_supply':'max'})
Next, we need a way to get the date for each month-end. For example, 2019-01-31, 2019-02-28, 2019-03-31, etc…
The built-in library calendar
has a neat method monthrange()
that can show the weekday of the first day and the number of days in a month. The below result (4,31) means: Oct 1st is a Friday, thus 5th weekday -> index position of 4. Then there are 31 days in October 2021.
>>> import calendar
>>> calendar.monthrange(2021,10)
(4, 31)
Then we can use a loop to build out the month-end days between Jan 2009 to Jan 2022, which should give us 157 valid data points.
dates = []
for y in df['year'].unique():
for m in range(1,13):
dates.append(datetime.datetime(y,m,calendar.monthrange(y,m)[1]).date())
With month-end dates available, we can locate the Stock and Flow at each month-end… By now we should have the bitcoin stock to flow ratio, next we’ll obtain the bitcoin market cap value.
final_df = df.loc[df['GMT_date'].isin(dates)]
Historical Bitcoin Price And Market Cap
Moving on to get historical prices for Bitcoin. The source with the most complete data I could find is https://ca.investing.com/indices/investing.com-btc-usd-historical-data. The historical price data starts from July 2010. So we’ll have to do some manual adjustments for the missing data. Below is a quote from PlanB’s article:
1$ for 1309 BTC Oct 2009, first quote of $0.003 on BitcoinMarket Mar 2010, 2 pizza’s worth $41 for 10,000 BTC May 2010)
We add these 3 data points then interpolate the rest. That’s two very expensive Pizza by the way. At today’s price, the 2 pizzas are 400 million USD!!
The market capitalization is fairly simple to calculate, which is BTC price * BTC Stock amount at each point in time.
I have prepared the monthly data and uploaded it here, feel free to grab it if you are interested.
df_ = pd.read_excel('btc_price_sf_ratio.xlsx')
Bitcoin Price Prediction Model With Python
We’ll need matplotlib
for some simple visualization, and sklearn
for the linear regression model.
pip install matplotlib
pip install numpy
pip install sklearn
Let’s first visualize the data. We see that the data is quite clustered, but overall there’s an upward trend.
import matplotlib.pyplot as plt
X = df_['sf_ratio'].values.reshape(-1,1)
y = df_['market_cap']
plt.scatter(X,y)
Given that the market cap spans many orders of magnitude, let’s log the market cap and plot again.
Something interesting begins to show… Let’s also try to log the x-axis, i.e. the Stock-To-Flow values, and re-plot.
That was magical! Now we have a near-perfect straight line trending upward. This graph suggests that BTC market capitalization is likely to be correlated with the Stock-To-Flow ratio. The higher the SF Ratio, the higher the market cap.
Is this true? Let’s throw the data into a linear regression model to find out.
Building A Linear Regression Model
It’s easy to train a linear regression model using scikit-learn. In the code below we can chain the model creation and training together for efficiency: LinearRegression().fit(x,y).
Essentially a linear regression model tries to find a line that best fits all the data points, such that the line can represent a general pattern shown by the data. For example, the red line below can represent a trend demonstrated by all the blue data points.
NOTE – a linear model tends to assume that x and y variables have a linear relationship. This is an assumption we and the model are making, we don’t know if it’s true or not.
Because this is a straight line, we can represent it using the linear equation y = b + wx. The linear regression model solves the weights or coefficients that satisfy the above equation.
- b is called “intercept” as it’s the point the linear equation crosses the y-axis.
- w is called the coefficient, or simply weight. This is also the slope of the line.
- x is the input variables, in our case the Stock to flow ratios. Note this is a vector/array of values.
- y is the output variable, the market cap. Note this is a vector/array of values.
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
import numpy as np
#assign input and output variables to X and y
X = df_['sf_ratio'].values.reshape(-1,1)
y = df_['market_cap']
#transform the variables using log function
y_log = np.log(y)
x_log = np.log(X)
linreg = LinearRegression().fit(x_log ,y_log)
Linear Regression Model Evaluation
To evaluate how well a model fits given data, we can use a metric called “r-square”, or “coefficient of determination”. In regression, the r-square measures how well the model prediction matches the real data points. Generally, r-square = 1 means a perfect fit, and a close to 0 r-square means a bad model prediction/fit.
linreg.score(x_log ,y_log )
The 0.947 r-square suggests that the linear model we just fitted (or “trained”) is a good fit. Therefore it should in theory provides good predictions.
Let’s create a function to train the model, plot the result then output some metrics.
def btc_fit_plot(x_, y_):
linreg = LinearRegression().fit(x_,y_)
y_pred = linreg.predict(x_)
fig,ax = plt.subplots()
plt.scatter(x_,y_, marker = 'o')
plt.plot(x_, y_pred, color = 'red')
ax.set_title('BTC market cap vs S_F ratio')
ax.set_xlabel('S_F ratio')
ax.set_ylabel('BTC Market Cap')
[ax.spines[loc].set_visible(False) for loc in ['top', 'right']]
ax.tick_params(left= False,bottom= False)
def summary():
explained_variance=metrics.explained_variance_score(y_, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_, y_pred)
mse=metrics.mean_squared_error(y_, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_, y_pred)
median_absolute_error=metrics.median_absolute_error(y_, y_pred)
r2=metrics.r2_score(y_, y_pred)
f_stat, p_f_stats = f_regression(x_, y_)
print(f'Model coefficient: {linreg.coef_}')
print(f'Model intercept: {linreg.intercept_}')
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print(f'F-stats: {f_stat}')
print(f'prob(F-stats): {p_f_stats}')
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
summary()
return linreg
Additional Metrics
I also wanted to see other metrics such as the p-value. However, I didn’t find a way to calculate that with scikit-learn except coding the actual calculation myself. At the same time, there’s another excellent library statsmodels that’s capable of producing p-values with ease. So I opted for that just for the p-values.
import statsmodels.api as sm
x_log_1 = sm.add_constant(x_log)
model=sm.OLS(y_log,x_log_1)
results = model.fit()
results.summary()
As you can see above, this is pretty much the same results as produced by scikit-learn, however with additional metrics such as the p-values, etc.
The p-value
So why am I going through all these troubles just to get a p-value? In statistics, we try to find a relationship between the input (x) and output (y) variables. So in our case, we hypothesize some correlation between the Bitcoin S_F ratio (x) and market cap (y). At the same time, there’s a “Null hypothesis” which is the opposite of our hypothesis – meaning there’s no correlation between S_F ratio and market cap.
This is called hypothesis testing. A p-value is used to either support or reject the null hypothesis. Usually, we use p-value = 0.05 or 5% as a threshold to determine whether to reject or accept the null hypothesis.
- Small p-values means that we have more evidence to REJECT the null hypothesis. In other words, our hypothesis (there is correlation between S_F ratio and market cap) is accepted.
- Big p-values means that we have more evidence to ACCEPT the null hypothesis. Or there is no correlation betwen S_F ratio and market cap in our case.
As we can see from the statsmodels summary, the p-values are close to 0, which means we should accept our alternative hypothesis, i.e. there’s a significant relationship between Bitcoin S_F ratio and market cap.
Conclustion
If this has been confusing to you, don’t worry. Let me put this in plain English: Stock to flow Ratio is a driving factor for the Bitcoin market cap. The larger the Stock-To-Flow Ratio, the higher the Bitcoin market cap.
NOTE, I just wanted to share my findings. This is NOT financial advice!