Last Updated on May 8, 2022 by Jay
This tutorial will show you how to find the closest value to a given input in pandas. Sometimes we try to filter a dataframe with a value, and the value doesn’t exist. Then we’ll receive an empty dataframe. Instead, we are interested in finding the closest value to that input in the dataframe.
Here’s a simple dataset we’ll use to demonstrate the technique. We have the SPY stock (imaginary) prices for 5 days.
import pandas as pd
df = pd.DataFrame({'SPY':[355,390,360,380,450]}, index=pd.date_range('2022-05-07', '2022-05-11'))
df
SPY
2022-05-07 355
2022-05-08 390
2022-05-09 360
2022-05-10 380
2022-05-11 450
Find The Closest Value To An Input In Pandas
Let’s say we want to find the row that’s closest to a price of $386.
By eyeballing the dataframe, we notice two values are close to the given input: 390 and 380. Obviously, the final answer is 390 since it’s closer to 386 than 380 is.
In this situation, we can’t use filters like greater than “>” or less than “<” since we don’t know whether the matching value is above or below the given input 386.
The Process
- Calculate the differences between each value and the input
- Make sure to take the absolute value on the differences to help with ranking, since there could be both positive and negative numbers
- Rank the result from step 2 above, the record with the lowest absolute difference is the record that’s closest to the input value
The code below shows results up to step 2:
find_it = 386
(df['SPY'] - find_it).abs()
2022-05-07 31
2022-05-08 4
2022-05-09 26
2022-05-10 6
2022-05-11 64
Next, what we could do (but not going to do) is use sort_values() on the dataframe, then find the first (lowest) entry. But there’s a better way to do this.
pandas argsort() method
Introducing the argsort() pandas method, which returns the integer indices that would sort the values. For example:
(df['SPY'] - find_it).abs().argsort()
2022-05-07 1
2022-05-08 3
2022-05-09 2
2022-05-10 0
2022-05-11 4
This might look a bit confusing at first, especially when you look at the ranking with the dates column. The following graph will help you understand:
- On the right-hand side, the original dataframe (or the absolute difference dataframe, since their index is the same) has a number index 0,1,2,3,4.
- On the left-hand side, ignore the index/date column. The argsort() returns the number index in an order
- If we apply this order to the original dataframe, as the below lines suggest, then we can sort the dataframe
- Value 4 (2022-05-08) row should go to the first place
- Value 6 (2022-05-10) row should go to the second place
- …
- And Value 64 (2022-05-11) row should go to the last place.
We can then use the .iloc[] property to re-organize the dataframe:
df.iloc[(df['SPY'] - find_it).abs().argsort(),:]
SPY
2022-05-08 390
2022-05-10 380
2022-05-09 360
2022-05-07 355
2022-05-11 450
If we just want to get the nearest value instead of the whole sorted dataframe:
df.iloc[(df['SPY'] - find_it).abs().argsort()[0],:]
SPY 390
Name: 2022-05-08 00:00:00, dtype: int64