# Create calculated columns in a dataframe

Sharing is caring!

Last Updated on July 14, 2022 by Jay

In Excel, we can create a calculated column by first write a formula (in a cell), then drag down the column. In PowerQuery, you can also add “Custom Column” and input a formula. In Python, we create calculated columns very much like the way in PQ – we create a column and the calculation will apply to the whole column instead of the going row by row like the “drag down” approach in Excel. To create a calculated column, we basically 1. create a column, and 2) assign a calculation to it.

This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.

## Prepare a dataframe for demo

We’ll once again use the SP500 company list for this tutorial. Let’s load up the dataset from Wikipedia:

import pandas as pd

Below is what the table looks like:

Next, let’s do some data manipulations! We have the HQ location, Date first added (to SP500 list), and the Founded Yr information. We are going to create four additional calculated columns to show the following:

City, State, Number of years on SP500, Age of company (since founding).

## Key to creating calculated columns in pandas

Coming from an Excel and VBA background, we are tempted to loop through everything in a column. Meaning that we create a formula for one cell, then drag down. This is NOT how you do things in Python. The correct way to do calculation is similar to PowerQuery, where we perform operations on the entire column instead of loop through each row. One takeaway from this tutorial is that we never loop through a column in pandas, we perform operations on the entire column instead. This is known as “vectorized” operation. PowerQuery appears to do this, but just not as efficient as Python.

## String operation in a pandas dataframe

Let’s look at the following example to split the city and state names from the HQ location column. df['Headquarters Location'] is a pandas Series, which is kind of like a column in Excel or PowerQuery. df['Headquarters Location'].str is the string values inside the column, which means we can use string methods directly on it. By doing operations this way, we are not looping through rows one by one.

<class 'pandas.core.series.Series'>
<class 'pandas.core.strings.StringMethods'>

## Datetime operation on a pandas dataframe

Next, let’s calculate the number of years the company has been on the SP500 list. We can use the column “Date first added” to derive this. First we need to know what type of data is stored in that column. We can find out by checking the first item in the column.

'1976-08-09'

So clearly, the column contains string data. It also appears that some companies have two dates stored in that column (e..g AT&T). For the demo purpose, we’ll use the first date and simply ignore the 2nd date.

Let’s convert the column into a datetime object, which is the standard data type for date and time in Python. Remember that we should never loop each row to perform a calculation. pandas actually provides a convenient way to convert string values into datetime data type.

Once the column is in datetime data type, calculating time duration becomes easy. We are going to import datetime library to handle date and time.

import datetime
today = datetime.datetime.today()
days = today - df['Date first added']
df['Yrs on SP500'] = days.dt.days / 365

A lot is happening here so I’ll explain line by line.

days is a pandas series that contains the number of days from “Date first added” to today. If you check the type, it says timedelta:

>>> days
0     16087 days 22:15:20.643337
1     20601 days 22:15:20.643337
2      2794 days 22:15:20.643337
3       817 days 22:15:20.643337
4      3338 days 22:15:20.643337
...
500    8359 days 22:15:20.643337
501     246 days 22:15:20.643337
502    6958 days 22:15:20.643337
503    7004 days 22:15:20.643337
504    2622 days 22:15:20.643337
Name: Date first added, Length: 505, dtype: timedelta64[ns]

Timedelta is a subclass of datetime. Similar to the .str that we just saw, pandas also has a .dt to return a columne of datetime objects. Therefore, days.dt.days simply returns the integer value of days from the timedelta object. Then divide the numbers by 365, we get a column for number of years.

## Handle NAN or Null values in a dataframe

When a cell is empty, pandas will automatically assign an NAN value to it. We need to take care of these values first because in most cases, pandas doesn’t know how to handle them. We can use .fillna() method to replace the NAN values with anything we want. For demonstration purpose, I’m just replacing NAN values with a string value of “0”.

>>> df.loc[df['Symbol'] =='XRAY']

Symbol         Security SEC filings  ... Founded
141   XRAY  Dentsply Sirona     reports  ...     NaN

>>> df['Founded'].fillna("0",inplace=True)

>>> df.loc[df['Symbol'] =='XRAY']

Symbol         Security SEC filings  ... Founded
141   XRAY  Dentsply Sirona     reports  ...       0

## Data type conversion & simple arithmetic operation on a pandas dataframe

Finally, we’ll use the Founded column to calculate the age of the companies. Similarly, some companies have two dates, we’ll keep only the first one.

df['Founded Yr'] = df['Founded'].str.split('(', expand = True)[0]
df['Founded Yr'] = df['Founded Yr'].str.split('/', expand = True)[0]

Without checking the data type, I know this newly created column contains string data, because the .split() method will return a string. So we’ll have to convert the string into numbers first. Since we are in year 2020, we’ll just use that to estimate the age of company by subtracting each “Founded Yr” from 2020.

df['Founded Yr'] = df['Founded Yr'].astype(int)
df['Age of company'] = 2020 - df['Founded Yr']