Use Pandas to Interpolate Missing Values

Sharing is caring!

Last Updated on July 14, 2022 by Jay

Sometimes we might want to interpolate and fill missing data as opposed to dropping them, and the pandas library offers a convenient way to do so.

One of the reasons that Python is a great language for doing data analysis is probably because of the pandas library, which makes data wrangling and analysis super easy.

Library

To install pandas, open up a command prompt window and type:

pip install pandas

Syntax

The full syntax is:

pandas.DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direct=None, limit_area=None, downcast=None, **kwargs)

However, we won’t need to use all the arguments except for edge cases. Therefore, in most scenarios, the above syntax can be reduced to:

pandas.DataFrame.interpolate(method=’linear’, axis=0, inplace=False, limit_direct=None)

  • method – the interpolation technique, can be one of (‘linear’, ‘time’, ‘index’, ‘values’, ‘pad’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’, ‘korgh’, ‘piecewise_polynomial,’ pchip’, ‘akima’, ‘cubicspline’, ‘from_derivaties’)
  • axis – 0 for rows and 1 for columns
  • inplace – True: overwrite data inplace, False: do not overwrite data
  • limit_direction – can be one of (‘forward’, ‘backward’, ‘both’)

Use Pandas to Interpolate missing data

We’ll use a simple dataset below to show how to use the dataframe.interpolate() method.

import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[1,3,np.nan,6,8,10, np.nan],
                   'col2':[np.nan, 0,1,2,3,4, np.nan]
                   'col3':[0,1,np.nan, np.nan,10,1,2]},
                 index=['2022-01-01', '2022-01-15', '2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01', '2022-08-01'])
df.index = pd.to_datetime(df.index)

df
            col1  col2  col3
2022-01-01   1.0   NaN   0.0
2022-01-15   3.0   0.0   1.0
2022-02-01   NaN   1.0   NaN
2022-03-01   6.0   2.0   NaN
2022-04-01   8.0   3.0  10.0
2022-05-01  10.0   NaN   1.0
2022-08-01   NaN   NaN   2.0

Linear Interpolation

The ‘linear’ is probably the most commonly used interpolation method, which ignores the index and treats all data points as equally spaced.

Note that below we are using all default arguments – which is leaving all blank, and it worked pretty well!

df.interpolate()
            col1  col2  col3
2022-01-01   1.0   NaN   0.0
2022-01-15   3.0   0.0   1.0
2022-02-01   4.5   1.0   4.0
2022-03-01   6.0   2.0   7.0
2022-04-01   8.0   3.0  10.0
2022-05-01  10.0   3.0   1.0
2022-08-01  10.0   3.0   2.0

The only data point it didn’t fill is the first one in col2. We can handle this by setting limit_direction=’both’, which will fill missing data both forward and backward.

df.interpolate(limit_direction='both')
            col1  col2  col3
2022-01-01   1.0   0.0   0.0
2022-01-15   3.0   0.0   1.0
2022-02-01   4.5   1.0   4.0
2022-03-01   6.0   2.0   7.0
2022-04-01   8.0   3.0  10.0
2022-05-01  10.0   3.0   1.0
2022-08-01  10.0   3.0   2.0

Time Based Interpolation

Since we have a DateTime index, we can use the ‘time’ interpolation method which interpolates based on the length of time intervals.

df.interpolate(method='time',limit_direction='both')
                 col1  col2       col3
2022-01-01   1.000000   0.0   0.000000
2022-01-15   3.000000   0.0   1.000000
2022-02-01   4.133333   1.0   3.013158
2022-03-01   6.000000   2.0   6.328947
2022-04-01   8.000000   3.0  10.000000
2022-05-01  10.000000   3.0   1.000000
2022-08-01  10.000000   3.0   2.000000

Let’s take a look at the interpolated value in col1 to verify the results:

  • Days between 1-15 to 2-01: 17 days
  • Total interval: 17+28 = 45 days
  • Interpolated value at 2022-02-01: 3 + (17/45) * (6-3) = 4.13333

Similarity With The fillna() Method

A few of the interpolation methods work the same as the fillna() method. For example the methods ‘zero’, ‘nearest’, ‘pad’, etc. Although these methods are interchangeable, it’s probably better to just use:

  • fillna() if your goal is to just fill the missing values
  • interpolate() if your goal is to interplate missing values

Additional Resources

Pandas cheat sheet

Integrate Python with Excel – from zero to hero

Leave a Reply

Your email address will not be published. Required fields are marked *