Last Updated on July 14, 2022 by Jay
In Excel, we can create a calculated column by first write a formula (in a cell), then drag down the column. In PowerQuery, you can also add “Custom Column” and input a formula. In Python, we create calculated columns very much like the way in PQ – we create a column and the calculation will apply to the whole column instead of the going row by row like the “drag down” approach in Excel. To create a calculated column, we basically 1. create a column, and 2) assign a calculation to it.
This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.
Prepare a dataframe for demo
We’ll once again use the SP500 company list for this tutorial. Let’s load up the dataset from Wikipedia:
import pandas as pd df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
Below is what the table looks like:
Next, let’s do some data manipulations! We have the HQ location, Date first added (to SP500 list), and the Founded Yr information. We are going to create four additional calculated columns to show the following:
City, State, Number of years on SP500, Age of company (since founding).
Key to creating calculated columns in pandas
Coming from an Excel and VBA background, we are tempted to loop through everything in a column. Meaning that we create a formula for one cell, then drag down. This is NOT how you do things in Python. The correct way to do calculation is similar to PowerQuery, where we perform operations on the entire column instead of loop through each row. One takeaway from this tutorial is that we never loop through a column in pandas, we perform operations on the entire column instead. This is known as “vectorized” operation. PowerQuery appears to do this, but just not as efficient as Python.
String operation in a pandas dataframe
Let’s look at the following example to split the city and state names from the HQ location column.
df['Headquarters Location'] is a pandas Series, which is kind of like a column in Excel or PowerQuery.
df['Headquarters Location'].str is the string values inside the column, which means we can use string methods directly on it. By doing operations this way, we are not looping through rows one by one.
>>> type(df['Headquarters Location']) <class 'pandas.core.series.Series'> >>> type(df['Headquarters Location'].str) <class 'pandas.core.strings.StringMethods'> df['City'] = df['Headquarters Location'].str.split(',', expand=True) df['State'] = df['Headquarters Location'].str.split(',', expand=True)
Datetime operation on a pandas dataframe
Next, let’s calculate the number of years the company has been on the SP500 list. We can use the column “Date first added” to derive this. First we need to know what type of data is stored in that column. We can find out by checking the first item in the column.
df['Date first added'] '1976-08-09'
So clearly, the column contains string data. It also appears that some companies have two dates stored in that column (e..g AT&T). For the demo purpose, we’ll use the first date and simply ignore the 2nd date.
df['Date first added'] = df['Date first added'].str.split('(', expand=True)
Let’s convert the column into a
datetime object, which is the standard data type for date and time in Python. Remember that we should never loop each row to perform a calculation. pandas actually provides a convenient way to convert string values into
datetime data type.
df['Date first added'] = pd.to_datetime(df['Date first added'])
Once the column is in
datetime data type, calculating time duration becomes easy. We are going to import
datetime library to handle date and time.
import datetime today = datetime.datetime.today() days = today - df['Date first added'] df['Yrs on SP500'] = days.dt.days / 365
A lot is happening here so I’ll explain line by line.
days is a
pandas series that contains the number of days from “Date first added” to today. If you check the type, it says
>>> days 0 16087 days 22:15:20.643337 1 20601 days 22:15:20.643337 2 2794 days 22:15:20.643337 3 817 days 22:15:20.643337 4 3338 days 22:15:20.643337 ... 500 8359 days 22:15:20.643337 501 246 days 22:15:20.643337 502 6958 days 22:15:20.643337 503 7004 days 22:15:20.643337 504 2622 days 22:15:20.643337 Name: Date first added, Length: 505, dtype: timedelta64[ns]
Timedelta is a subclass of
datetime. Similar to the
.str that we just saw, pandas also has a
.dt to return a columne of
datetime objects. Therefore,
days.dt.days simply returns the integer value of days from the
timedelta object. Then divide the numbers by 365, we get a column for number of years.
Handle NAN or Null values in a dataframe
When a cell is empty, pandas will automatically assign an
NAN value to it. We need to take care of these values first because in most cases, pandas doesn’t know how to handle them. We can use
.fillna() method to replace the
NAN values with anything we want. For demonstration purpose, I’m just replacing
NAN values with a string value of “0”.
>>> df.loc[df['Symbol'] =='XRAY'] Symbol Security SEC filings ... Founded 141 XRAY Dentsply Sirona reports ... NaN >>> df['Founded'].fillna("0",inplace=True) >>> df.loc[df['Symbol'] =='XRAY'] Symbol Security SEC filings ... Founded 141 XRAY Dentsply Sirona reports ... 0
Data type conversion & simple arithmetic operation on a pandas dataframe
Finally, we’ll use the Founded column to calculate the age of the companies. Similarly, some companies have two dates, we’ll keep only the first one.
df['Founded Yr'] = df['Founded'].str.split('(', expand = True) df['Founded Yr'] = df['Founded Yr'].str.split('/', expand = True)
Without checking the data type, I know this newly created column contains string data, because the
.split() method will return a string. So we’ll have to convert the string into numbers first. Since we are in year 2020, we’ll just use that to estimate the age of company by subtracting each “Founded Yr” from 2020.
df['Founded Yr'] = df['Founded Yr'].astype(int) df['Age of company'] = 2020 - df['Founded Yr']