Convert Text/String To Number In Pandas

Sharing is caring!

Last Updated on January 10, 2022 by Jay

In this short tutorial, we’ll learn how to convert text or string data to numbers in pandas.

We’ll take look at two pandas built-in methods to convert string to numbers. As well as how to handle some of the special cases when these two methods alone don’t work.

Run the following code to create a sample dataframe. Every column contains text/string and we’ll convert them into numbers using different techniques. We use list comprehension to create several lists of strings, then put them into a dataframe.

import pandas as pd
import numpy as np

l1 = [f'10{i}' for i in range(10,30)]
l2 = [f'10{i}.{i}' for i in range(10,30)]
l3 = [f'1.0{i}.{i}' for i in range(10, 30)]
l4 = [f'1,0{i}' for i in range(10, 30)]
l5 = [f'$1,0{i}' for i in range(10, 30)]
l6 = [f'{i}%' for i in range(20)]
l7 = [f'{i}.{i}%' for i in range(20)]
l8 = [f'10{i}' if i%2 == 0 else 'asdf' for i in range(10,30)]
l9 = [f'$1,0{i}' if i%2 == 0 else 'asdf' for i in range(10,30)]

df = pd.DataFrame({'l1':l1,
                   'l2':l2,
                   'l3':l3,
                   'l4':l4,
                   'l5':l5,
                   'l6':l6,
                   'l7':l7,
                   'l8':l8,
                   'l9':l9,
    })

Take a peek at the first 5 rows of the dataframe using the df.head() method. Keep in mind that all the values in the dataframe are string data type.

>>> df.head()
     l1       l2        l3     l4      l5  l6    l7    l8      l9
0  1010  1010.10  1.010.10  1,010  $1,010  0%  0.0%  1010  $1,010
1  1011  1011.11  1.011.11  1,011  $1,011  1%  1.1%  asdf    asdf
2  1012  1012.12  1.012.12  1,012  $1,012  2%  2.2%  1012  $1,012
3  1013  1013.13  1.013.13  1,013  $1,013  3%  3.3%  asdf    asdf
4  1014  1014.14  1.014.14  1,014  $1,014  4%  4.4%  1014  $1,014

The df.astype() method

This is probably the easiest way. We can take a column of strings then force the data type to be numbers (i.e. Integer or Float).

For the first column, since we know it’s supposed to be “integers” so we can put int in the astype() conversion method.

>>> df['l1'].astype(int).head()
0    1010
1    1011
2    1012
3    1013
4    1014
Name: l1, dtype: int32

However, the int will not work if the data contain decimals. In this case, we need to pass float into the method argument.

>>> df['l2'].astype(float).head()
0    1010.10
1    1011.11
2    1012.12
3    1013.13
4    1014.14
Name: l2, dtype: float64

This method looks easy to apply, but that’s pretty much all it can do – it doesn’t work for the rest of the columns. The reason is that other columns all contain some sort of special characters such as comma (,), dollar sign ($), percentage (%), and so on. Apparently, the .astype() method cannot handle those special characters.

The pd.to_numeric() method

This method works similar to df.astype() in a way that they don’t recognize the special characters such as the currency symbols ($) or the thousand separators (dot or comma).

However, this method is handy in some situations where we need to clean up data. For example, the data in column l8 is a mix of “text’d numbers (e.g. “1010”) and other real text (e.g. “asdf”). In the pd.to_numeric method, when error = ' coerce', the code will run without raising errors but will return NaN for invalid numbers.

Then we can replace those NaN with other dummy values such as 0.

>>> pd.to_numeric(df['l8'], errors = 'coerce')
0     1010.0
1        NaN
2     1012.0
3        NaN
4     1014.0
5        NaN
.....

>>> pd.to_numeric(df['l8'], errors = 'coerce').replace(np.nan, 0)
0     1010.0
1        0.0
2     1012.0
3        0.0
4     1014.0
5        0.0

Data that contains special characters

For those columns that contain special characters such as the dollar sign, percentage sign, dot, or comma, we need to remove those characters first before converting the text into numbers.

We can use the df.str to access an entire column of strings, then replace the special characters using the .str.replace() method. For example:

df['l3'] = df['l3'].str.replace('.', '', n=1)

The n=1 argument above means that we are replacing only the first occurrence (from the start of the string) of the ‘.’. By default, n is set to -1, which will replace all occurrences.

DO NOT confuse the .str.replace() with df.replace(). The former operates only on strings; whereas the latter works on either strings or numbers.

Let’s also clean up the other columns:

df['l4'] = df['l4'].str.replace(',', '')
df['l5'] = df['l5'].str.replace(',', '').str.replace('$', '')
df['l6'] = df['l6'].str.replace('%', '')
df['l9'] = df['l9'].str.replace('$', '').str.replace(',', '')

After removing all the special characters, now we can use either df.astype() or pd.to_numeric() to convert text to numbers.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *