Last Updated on July 14, 2022 by Jay
We have seen many different ways to load data into Python using pandas, such as .read_csv() or .read_excel(). Those methods work like “Open File” in Excel, but we often need to “Create New File” too! So today let’s go through how to create an empty pandas dataframe (i.e. like a blank Excel sheet).
This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.
General syntax
There are many ways to create a dataframe in pandas, I will talk about a few that I use the most often and most intuitive. All these ways actually starts from the same syntax pd.DataFrame()
. There are a few notable arguments we can pass into the parentheses:
data
: quite literally, this is the data you want to place inside the dataframe.index
: name the indexcolumns
: name the columns
The data argument here is quite versatile, which can take many different forms: int, string, boolean, list, tuple, dictionary, etc.
Create a nxm size dataframe
Let’s create a 10 row by 5 columns dataframe filled with the value of 1. Here we specify data = 1
, and 10 rows (index), and 5 columns.
>>> pd.DataFrame(data = 1, index=range(10), columns = range(5))
0 1 2 3 4
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
5 1 1 1 1 1
6 1 1 1 1 1
7 1 1 1 1 1
8 1 1 1 1 1
9 1 1 1 1 1
Create a dataframe from lists
Creating a dataframe from lists can be confusing at first. But once you get the hang of it, it will slowly become intuitive. Let’s look at the following example. We have two lists, then we create a list of lists [a,b]
. Pay attention to how it looks like on the output line.
a = [1,2,3,4,5]
b = ['v','x','x','y','z']
>>> [a,b]
[[1, 2, 3, 4, 5],
['v', 'x', 'x', 'y', 'z']]
Now let’s create a dataframe from the list of lists [a,b]
. It literally just put the above structure into a dataframe. Since we didn’t specify index and columns arguments, by default they are set to integer values starting from 0, remember that Python is zero-based index?
>>> pd.DataFrame([a,b])
0 1 2 3 4
0 1 2 3 4 5
1 v x x y z
The above is actually quite intuitive if you look at [a,b]
and the new dataframe. However, what if your intention was to create 2 columns, with the first column contains the values in a, and 2nd column contains the values in b? You can still use lists, but this time you have to zip()
them. Let’s see what zip does.
>>> zip(a,b)
<zip object at 0x000001C933619AC0>
Okay, but what is a zip object anyway? It’s actually an iterator, which is just an object that you are iterate (loop) through. Generally speaking, if you want to see what’s inside an iterator, simply do a loop and print out the elements from it like this.
>>> for i in zip(a,b):
print(i)
(1, 'v')
(2, 'x')
(3, 'x')
(4, 'y')
(5, 'z')
Remember what the list of lists [a,b]
looked like? Now if you create a dataframe from this iterator, you will get two columns of data:
>>> pd.DataFrame(zip(a,b))
0 1
0 1 v
1 2 x
2 3 x
3 4 y
4 5 z
Create a dataframe from dictionary
My favorite method to create a dataframe is from a dictionary. Because personally I feel this one has the best readability. When we feed the dataframe() with a dictionary, the keys will automatically become the column names. Let’s start by constructing a dictionary of lists.
>>> {'a':a,'b':b}
{'a': [1, 2, 3, 4, 5],
'b': ['v', 'x', 'x', 'y', 'z']}
So we have two items inside this dictionary, first item name is ‘a’, and the second item name is ‘b’. Let’s create a dataframe from the above dictionary.
>>> pd.DataFrame({'a': a,
'b': b})
a b
0 1 v
1 2 x
2 3 x
3 4 y
4 5 z
The above method is equivalent to the following but more readable.
>>> pd.DataFrame(zip(a,b), columns = ['a','b'])
a b
0 1 v
1 2 x
2 3 x
3 4 y
4 5 z
Conclusion
Remember that a dataframe is super flexible, once you create it, you can adjust its size to fit your needs. We can freely insert rows or columns into the dataframe and vice versa (using our previous 10 x 5 dataframe example).
df = pd.DataFrame(data = 1, index=range(10), columns = range(5))
df['6th col'] = 6
df.loc[10,:] = 10
>>> df
0 1 2 3 4 6th col
0 1.0 1.0 1.0 1.0 1.0 6.0
1 1.0 1.0 1.0 1.0 1.0 6.0
2 1.0 1.0 1.0 1.0 1.0 6.0
3 1.0 1.0 1.0 1.0 1.0 6.0
4 1.0 1.0 1.0 1.0 1.0 6.0
5 1.0 1.0 1.0 1.0 1.0 6.0
6 1.0 1.0 1.0 1.0 1.0 6.0
7 1.0 1.0 1.0 1.0 1.0 6.0
8 1.0 1.0 1.0 1.0 1.0 6.0
9 1.0 1.0 1.0 1.0 1.0 6.0
10 10.0 10.0 10.0 10.0 10.0 10.0
This is probably obvious, but I still want to point out. Once we create a dataframe, to be more specific, a pd.DataFrame()
object, we can access all the wonderful methods that pandas has to offer! For example, we can sort the dataframe rows by decreasing order:
>>> df.sort_index(ascending=False)
0 1 2 3 4 6th col
10 10.0 10.0 10.0 10.0 10.0 10.0
9 1.0 1.0 1.0 1.0 1.0 6.0
8 1.0 1.0 1.0 1.0 1.0 6.0
7 1.0 1.0 1.0 1.0 1.0 6.0
6 1.0 1.0 1.0 1.0 1.0 6.0
5 1.0 1.0 1.0 1.0 1.0 6.0
4 1.0 1.0 1.0 1.0 1.0 6.0
3 1.0 1.0 1.0 1.0 1.0 6.0
2 1.0 1.0 1.0 1.0 1.0 6.0
1 1.0 1.0 1.0 1.0 1.0 6.0
0 1.0 1.0 1.0 1.0 1.0 6.0