Last Updated on July 14, 2022 by Jay
Sometimes we want to calculate the difference between rows in a pandas dataframe, instead of looping through the rows, we can use the dataframe.diff() method.
For Excel users, it’s tempting to use a loop to calculate the difference between rows, because that’s how it’s done in Excel. However, pandas provides a much simpler solution.
We’ll use the below sample dataframe for the illustration. Imagine we have the prices for two stocks: SPY and TSLA.
import pandas
df = pd.DataFrame({'SPY':[400,405,420,450,500],
                   'TSLA':[200,400,800,700,1000]},
                          index=pd.date_range("2022-04-20", "2022-04-25"))
df
	        SPY	TSLA
2022-04-18	400	200
2022-04-19	405	400
2022-04-20	420	800
2022-04-21	450	700
2022-04-22	500	1000
Pandas diff() Syntax
DataFrame.diff(periods = 1,
               axis = 0)Calculate The Difference Between Rows In Pandas Dataframe
We can calculate the daily difference of the stocks without looping through the rows. The argument periods controls the periods to shift for calculating the differences between rows. The default value is 1.
The below example calculates daily differences in stock prices. The first row is NaN because there are no previous values to calculate. Starting from the second row, it basically takes values from the second row of the original dataframe, and subtracts the values from the first row of the original dataframe. E.g. 405-400 =5, and 400-200=200.
df.diff()
             SPY   TSLA
2022-04-18   NaN    NaN
2022-04-19   5.0  200.0
2022-04-20  15.0  400.0
2022-04-21  30.0 -100.0
2022-04-22  50.0  300.0For the same reasoning, we can calculate the difference between rows backward by setting the periods to a negative number. This is quite convenient as we don’t have to reverse the order of the data.
df.diff(periods = -1)
             SPY   TSLA
2022-04-18  -5.0 -200.0
2022-04-19 -15.0 -400.0
2022-04-20 -30.0  100.0
2022-04-21 -50.0 -300.0
2022-04-22   NaN    NaNWe can also calculate the difference between non-consecutive rows by setting the periods to a number other than 1.
df.diff(periods = 2)
             SPY   TSLA
2022-04-18   NaN    NaN
2022-04-19   NaN    NaN
2022-04-20  20.0  600.0
2022-04-21  45.0  300.0
2022-04-22  80.0  200.0To help visualize the above example, we can first shift() the columns down by 2 rows, and then perform a subtraction. The below code replicates the diff(periods=2) above.
df[['SPY_2','TSLA_2']] = df[['SPY','TSLA']].shift(2)
            SPY  TSLA  SPY_2  TSLA_2
2022-04-18  400   200    NaN     NaN
2022-04-19  405   400    NaN     NaN
2022-04-20  420   800  400.0   200.0
2022-04-21  450   700  405.0   400.0
2022-04-22  500  1000  420.0   800.0
df['SPY_diff'] = df['SPY'] - df['SPY_2']
df['TSLA_diff'] = df['TSLA'] - df['TSLA_2']
            SPY  TSLA  SPY_2  TSLA_2  SPY_diff  TSLA_diff
2022-04-18  400   200    NaN     NaN       NaN        NaN
2022-04-19  405   400    NaN     NaN       NaN        NaN
2022-04-20  420   800  400.0   200.0      20.0      600.0
2022-04-21  450   700  405.0   400.0      45.0      300.0
2022-04-22  500  1000  420.0   800.0      80.0      200.0
Calculate The Difference Between Columns
It’s also possible to calculate the difference between columns of a dataframe by setting the axis argument to 1 (or ‘columns’). The axis argument in pandas usually has a default value of 0 (i.e. rows).
df.diff(periods =1, axis = 1)
	        SPY	TSLA
2022-04-18	NaN	-200
2022-04-19	NaN	-5
2022-04-20	NaN	380
2022-04-21	NaN	250
2022-04-22	NaN	500