Last Updated on July 14, 2022 by Jay
Sometimes we might want to interpolate and fill missing data as opposed to dropping them, and the pandas library offers a convenient way to do so.
One of the reasons that Python is a great language for doing data analysis is probably because of the pandas library, which makes data wrangling and analysis super easy.
Library
To install pandas, open up a command prompt window and type:
pip install pandas
Syntax
The full syntax is:
pandas.DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direct=None, limit_area=None, downcast=None, **kwargs)
However, we won’t need to use all the arguments except for edge cases. Therefore, in most scenarios, the above syntax can be reduced to:
pandas.DataFrame.interpolate(method=’linear’, axis=0, inplace=False, limit_direct=None)
- method – the interpolation technique, can be one of (‘linear’, ‘time’, ‘index’, ‘values’, ‘pad’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’, ‘korgh’, ‘piecewise_polynomial,’ pchip’, ‘akima’, ‘cubicspline’, ‘from_derivaties’)
- axis – 0 for rows and 1 for columns
- inplace – True: overwrite data inplace, False: do not overwrite data
- limit_direction – can be one of (‘forward’, ‘backward’, ‘both’)
Use Pandas to Interpolate missing data
We’ll use a simple dataset below to show how to use the dataframe.interpolate() method.
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[1,3,np.nan,6,8,10, np.nan],
'col2':[np.nan, 0,1,2,3,4, np.nan]
'col3':[0,1,np.nan, np.nan,10,1,2]},
index=['2022-01-01', '2022-01-15', '2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01', '2022-08-01'])
df.index = pd.to_datetime(df.index)
df
col1 col2 col3
2022-01-01 1.0 NaN 0.0
2022-01-15 3.0 0.0 1.0
2022-02-01 NaN 1.0 NaN
2022-03-01 6.0 2.0 NaN
2022-04-01 8.0 3.0 10.0
2022-05-01 10.0 NaN 1.0
2022-08-01 NaN NaN 2.0
Linear Interpolation
The ‘linear’ is probably the most commonly used interpolation method, which ignores the index and treats all data points as equally spaced.
Note that below we are using all default arguments – which is leaving all blank, and it worked pretty well!
df.interpolate()
col1 col2 col3
2022-01-01 1.0 NaN 0.0
2022-01-15 3.0 0.0 1.0
2022-02-01 4.5 1.0 4.0
2022-03-01 6.0 2.0 7.0
2022-04-01 8.0 3.0 10.0
2022-05-01 10.0 3.0 1.0
2022-08-01 10.0 3.0 2.0
The only data point it didn’t fill is the first one in col2. We can handle this by setting limit_direction=’both’, which will fill missing data both forward and backward.
df.interpolate(limit_direction='both')
col1 col2 col3
2022-01-01 1.0 0.0 0.0
2022-01-15 3.0 0.0 1.0
2022-02-01 4.5 1.0 4.0
2022-03-01 6.0 2.0 7.0
2022-04-01 8.0 3.0 10.0
2022-05-01 10.0 3.0 1.0
2022-08-01 10.0 3.0 2.0
Time Based Interpolation
Since we have a DateTime index, we can use the ‘time’ interpolation method which interpolates based on the length of time intervals.
df.interpolate(method='time',limit_direction='both')
col1 col2 col3
2022-01-01 1.000000 0.0 0.000000
2022-01-15 3.000000 0.0 1.000000
2022-02-01 4.133333 1.0 3.013158
2022-03-01 6.000000 2.0 6.328947
2022-04-01 8.000000 3.0 10.000000
2022-05-01 10.000000 3.0 1.000000
2022-08-01 10.000000 3.0 2.000000
Let’s take a look at the interpolated value in col1 to verify the results:
- Days between 1-15 to 2-01: 17 days
- Total interval: 17+28 = 45 days
- Interpolated value at 2022-02-01: 3 + (17/45) * (6-3) = 4.13333
Similarity With The fillna() Method
A few of the interpolation methods work the same as the fillna() method. For example the methods ‘zero’, ‘nearest’, ‘pad’, etc. Although these methods are interchangeable, it’s probably better to just use:
- fillna() if your goal is to just fill the missing values
- interpolate() if your goal is to interplate missing values