Last Updated on July 14, 2022 by Jay
Sometimes we want to calculate the difference between rows in a pandas dataframe, instead of looping through the rows, we can use the dataframe.diff() method.
For Excel users, it’s tempting to use a loop to calculate the difference between rows, because that’s how it’s done in Excel. However, pandas provides a much simpler solution.
We’ll use the below sample dataframe for the illustration. Imagine we have the prices for two stocks: SPY and TSLA.
import pandas
df = pd.DataFrame({'SPY':[400,405,420,450,500],
'TSLA':[200,400,800,700,1000]},
index=pd.date_range("2022-04-20", "2022-04-25"))
df
SPY TSLA
2022-04-18 400 200
2022-04-19 405 400
2022-04-20 420 800
2022-04-21 450 700
2022-04-22 500 1000
Pandas diff() Syntax
DataFrame.diff(periods = 1,
axis = 0)
Calculate The Difference Between Rows In Pandas Dataframe
We can calculate the daily difference of the stocks without looping through the rows. The argument periods controls the periods to shift for calculating the differences between rows. The default value is 1.
The below example calculates daily differences in stock prices. The first row is NaN because there are no previous values to calculate. Starting from the second row, it basically takes values from the second row of the original dataframe, and subtracts the values from the first row of the original dataframe. E.g. 405-400 =5, and 400-200=200.
df.diff()
SPY TSLA
2022-04-18 NaN NaN
2022-04-19 5.0 200.0
2022-04-20 15.0 400.0
2022-04-21 30.0 -100.0
2022-04-22 50.0 300.0
For the same reasoning, we can calculate the difference between rows backward by setting the periods to a negative number. This is quite convenient as we don’t have to reverse the order of the data.
df.diff(periods = -1)
SPY TSLA
2022-04-18 -5.0 -200.0
2022-04-19 -15.0 -400.0
2022-04-20 -30.0 100.0
2022-04-21 -50.0 -300.0
2022-04-22 NaN NaN
We can also calculate the difference between non-consecutive rows by setting the periods to a number other than 1.
df.diff(periods = 2)
SPY TSLA
2022-04-18 NaN NaN
2022-04-19 NaN NaN
2022-04-20 20.0 600.0
2022-04-21 45.0 300.0
2022-04-22 80.0 200.0
To help visualize the above example, we can first shift() the columns down by 2 rows, and then perform a subtraction. The below code replicates the diff(periods=2) above.
df[['SPY_2','TSLA_2']] = df[['SPY','TSLA']].shift(2)
SPY TSLA SPY_2 TSLA_2
2022-04-18 400 200 NaN NaN
2022-04-19 405 400 NaN NaN
2022-04-20 420 800 400.0 200.0
2022-04-21 450 700 405.0 400.0
2022-04-22 500 1000 420.0 800.0
df['SPY_diff'] = df['SPY'] - df['SPY_2']
df['TSLA_diff'] = df['TSLA'] - df['TSLA_2']
SPY TSLA SPY_2 TSLA_2 SPY_diff TSLA_diff
2022-04-18 400 200 NaN NaN NaN NaN
2022-04-19 405 400 NaN NaN NaN NaN
2022-04-20 420 800 400.0 200.0 20.0 600.0
2022-04-21 450 700 405.0 400.0 45.0 300.0
2022-04-22 500 1000 420.0 800.0 80.0 200.0
Calculate The Difference Between Columns
It’s also possible to calculate the difference between columns of a dataframe by setting the axis argument to 1 (or ‘columns’). The axis argument in pandas usually has a default value of 0 (i.e. rows).
df.diff(periods =1, axis = 1)
SPY TSLA
2022-04-18 NaN -200
2022-04-19 NaN -5
2022-04-20 NaN 380
2022-04-21 NaN 250
2022-04-22 NaN 500