Last Updated on January 10, 2022 by Jay
In this short tutorial, we’ll learn how to convert text or string data to numbers in pandas.
We’ll take look at two pandas built-in methods to convert string to numbers. As well as how to handle some of the special cases when these two methods alone don’t work.
Run the following code to create a sample dataframe. Every column contains text/string and we’ll convert them into numbers using different techniques. We use list comprehension to create several lists of strings, then put them into a dataframe.
import pandas as pd
import numpy as np
l1 = [f'10{i}' for i in range(10,30)]
l2 = [f'10{i}.{i}' for i in range(10,30)]
l3 = [f'1.0{i}.{i}' for i in range(10, 30)]
l4 = [f'1,0{i}' for i in range(10, 30)]
l5 = [f'$1,0{i}' for i in range(10, 30)]
l6 = [f'{i}%' for i in range(20)]
l7 = [f'{i}.{i}%' for i in range(20)]
l8 = [f'10{i}' if i%2 == 0 else 'asdf' for i in range(10,30)]
l9 = [f'$1,0{i}' if i%2 == 0 else 'asdf' for i in range(10,30)]
df = pd.DataFrame({'l1':l1,
'l2':l2,
'l3':l3,
'l4':l4,
'l5':l5,
'l6':l6,
'l7':l7,
'l8':l8,
'l9':l9,
})
Take a peek at the first 5 rows of the dataframe using the df.head()
method. Keep in mind that all the values in the dataframe are string data type.
>>> df.head()
l1 l2 l3 l4 l5 l6 l7 l8 l9
0 1010 1010.10 1.010.10 1,010 $1,010 0% 0.0% 1010 $1,010
1 1011 1011.11 1.011.11 1,011 $1,011 1% 1.1% asdf asdf
2 1012 1012.12 1.012.12 1,012 $1,012 2% 2.2% 1012 $1,012
3 1013 1013.13 1.013.13 1,013 $1,013 3% 3.3% asdf asdf
4 1014 1014.14 1.014.14 1,014 $1,014 4% 4.4% 1014 $1,014
The df.astype() method
This is probably the easiest way. We can take a column of strings then force the data type to be numbers (i.e. Integer or Float).
For the first column, since we know it’s supposed to be “integers” so we can put int
in the astype()
conversion method.
>>> df['l1'].astype(int).head()
0 1010
1 1011
2 1012
3 1013
4 1014
Name: l1, dtype: int32
However, the int
will not work if the data contain decimals. In this case, we need to pass float
into the method argument.
>>> df['l2'].astype(float).head()
0 1010.10
1 1011.11
2 1012.12
3 1013.13
4 1014.14
Name: l2, dtype: float64
This method looks easy to apply, but that’s pretty much all it can do – it doesn’t work for the rest of the columns. The reason is that other columns all contain some sort of special characters such as comma (,), dollar sign ($), percentage (%), and so on. Apparently, the .astype()
method cannot handle those special characters.
The pd.to_numeric() method
This method works similar to df.astype()
in a way that they don’t recognize the special characters such as the currency symbols ($) or the thousand separators (dot or comma).
However, this method is handy in some situations where we need to clean up data. For example, the data in column l8 is a mix of “text’d numbers (e.g. “1010”) and other real text (e.g. “asdf”). In the pd.to_numeric
method, when error = ' coerce'
, the code will run without raising errors but will return NaN for invalid numbers.
Then we can replace those NaN with other dummy values such as 0.
>>> pd.to_numeric(df['l8'], errors = 'coerce')
0 1010.0
1 NaN
2 1012.0
3 NaN
4 1014.0
5 NaN
.....
>>> pd.to_numeric(df['l8'], errors = 'coerce').replace(np.nan, 0)
0 1010.0
1 0.0
2 1012.0
3 0.0
4 1014.0
5 0.0
Data that contains special characters
For those columns that contain special characters such as the dollar sign, percentage sign, dot, or comma, we need to remove those characters first before converting the text into numbers.
We can use the df.str
to access an entire column of strings, then replace the special characters using the .str.replace()
method. For example:
df['l3'] = df['l3'].str.replace('.', '', n=1)
The n=1 argument above means that we are replacing only the first occurrence (from the start of the string) of the ‘.’. By default, n is set to -1, which will replace all occurrences.
DO NOT confuse the .str.replace()
with df.replace()
. The former operates only on strings; whereas the latter works on either strings or numbers.
Let’s also clean up the other columns:
df['l4'] = df['l4'].str.replace(',', '')
df['l5'] = df['l5'].str.replace(',', '').str.replace('$', '')
df['l6'] = df['l6'].str.replace('%', '')
df['l9'] = df['l9'].str.replace('$', '').str.replace(',', '')
After removing all the special characters, now we can use either df.astype() or pd.to_numeric() to convert text to numbers.
One comment