Last Updated on July 14, 2022 by Jay
How many times have you had to split a text in Excel? I bet too many to count. Most Excel users will have to do this at least a few times in their career. There are several ways to split text to columns in Excel, using formula, using “Text to Columns”, or using PowerQuery. All three ways are kind of inefficient and requires manual input. To save you from all the manual work, we’ll show how to split text to columns in a Python dataframe.
This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.
Prepare a dataframe for demo
We’ll use a simple example for this tutorial, feel free to download the Excel file and follow along. The sample file contains two columns, a person’s name and their date of birth. Note that all birth dates are artificial and I just made up random dates for demonstration purpose.
Our task is the following:
- Split first and last names
- Split the birth dates into year, month and day
Let’s load up the data into Python. Here I purposely force the Date of Birth column to be a String, so I can show the slicing method. In reality pandas should automatically detect this column is likely a datetime and assign datetime object for it, which makes processing date data even easier. Anyways, let’s focus on how to split a text data.
import pandas as pd
df = pd.read_excel('split_text.xlsx', dtype={'Name':str,'Date of Birth':str})
Do not loop, use vectorized operation
For people coming from an Excel background and who also tends to use formula to solve this problem, our first reaction is: Okay I’m gonna create a formula probably with find() and left() or mid(), etc. Then drag down to apply it to all cells. While this is okay to do in Excel, it’s never the right thing to do in Python. The above action: 1. create a formula, and 2. drag down is called a “loop” for a programming language. When we use pandas to process data, we never loop. Instead, we use vectorized operation to achieve blazing fast speed.
The vectorized operation is kind of equivalent (on surface) to Excel’s “Text to Columns” button or PowerQuery’s “Split Column”, where we select a column and do something on that entire column. In Python, vectorized operation is the standard way to do anything with your data, because it’s hundreds times faster than looping. We’ll talk about why it’s so much faster in another post.
Once we load the Excel table into a pandas, the entire table becomes a pandas dataframe, and the column “Date of Birth” becomes a pandas series. Since we can’t loop, we’ll need a way to access the string elements inside that pandas series. This is where the .str comes into place. It basically gives access to the string elements inside a series, so we can perform regular String methods on a column.
Python String slicing
Let’s first handle the dates, since they look equally spaced out and should be easier. We can use Python String slicing to get the year, month and date. String is essentially like a tuple, and we can use the same list slicing techniques on a String. Take a look at the following example.
>>> today = '2020-08-30'
>>> today[0:4]
'2020'
>>> today[5:7]
'08'
>>> today[8:10]
'30'
To use this slicing method on a column of dataframe, we can do the following:
>>> df['Date of Birth']
0 1970-12-30 00:00:00
1 1960-01-01 00:00:00
2 1955-03-10 00:00:00
3 1930-08-17 00:00:00
4 1965-11-20 00:00:00
5 1930-01-10 00:00:00
Name: Date of Birth, dtype: object
>>> df['Date of Birth'].str[0:4]
0 1970
1 1960
2 1955
3 1930
4 1965
5 1930
Name: Date of Birth, dtype: object
>>> df['Date of Birth'].str[5:7]
0 12
1 01
2 03
3 08
4 11
5 01
Name: Date of Birth, dtype: object
>>> df['Date of Birth'].str[8:10]
0 30
1 01
2 10
3 17
4 20
5 10
String .split() method
The .split() method allows splitting a text into pieces based on given delimiters. Let’s look at an example:
>>> word = "hello, world"
>>> word.split(',')
['hello', ' world']
The above example splits a String into two words by using comma as the delimiter. Technically we can use an character as the delimiter. Note the return result is a list of two words (string).
So how to apply this on a dataframe column? You probably got it, we use .str! Let’s try it on the Name column to get first and last name.
>>> df['Name'].str.split(',')
0 [Musk, Elon]
1 [Cook, Tim]
2 [Gates, Bill]
3 [Buffet, Warren]
4 [Ma, Jack]
5 [Trump, Donald]
Name: Name, dtype: object
>>> type(df['Name'].str.split(','))
<class 'pandas.core.series.Series'>
The split was successful, but when we check the data type, it appears it’s a pandas series that contains a list of two words for each row. What we want is to split the text into two different columns (pandas series). It seems we have a problem, but don’t worry! The pandas str.split() method has an optional argument: expand. When set to True, it can return the split items into different columns!
>>> df['Name'].str.split(',', expand=True)
0 1
0 Musk Elon
1 Cook Tim
2 Gates Bill
3 Buffet Warren
4 Ma Jack
5 Trump Donald
>>> type(df['Name'].str.split(',', expand=True))
<class 'pandas.core.frame.DataFrame'>
As expected, since there are multiple columns (series), the return result is actually a dataframe. Now we can split text into different columns easily:
df['First Name'] = df['Name'].str.split(',', expand=True)[1]
df['Last Name'] = df['Name'].str.split(',', expand=True)[0]