Last Updated on December 5, 2021 by Jay
In this tutorial, we’ll learn exploratory data analysis (EDA) with a Python library called
pandasgui. Note this library is not part of
pandas, rather it’s a standalone library that we need to install.
Several alternative tools for EDA in Python:
- Pandas Profiling
- Pandas Gui
pip install pandasgui
Depending on your computer’s setting, you might encounter an error during the installation like I did. If you get this error:
Microsoft VIsual C++ 14.0 or greater is required. Get it with “Microsoft C++ Build Tools”….
Then do not worry, simply go to the site followed by the error message. Likely:
Download the Build Tool, and install the default selections.
Once the C++ Build Tool finishes the installation, close and re-open the Command Prompt, then
pip install pandasgui again. This time it should work.
We are going to use the pokemon dataset for this tutorial.
from pandasgui.datasets import pokemon
When we import the pokemon dataset for the first time, it will be downloaded and saved to our local folder.
To start the pandasgui, simply use show(). We can choose to load multiple datasets at once by passing the dataframes into the show() method.
from pandasgui import show
#show(df1, df2, df3) #if we want to load multiple datasets at once
Run the code and we’ll see a Graphic User Interface (GUI) popping up. Our dataset is currently loaded into the program, and we can switch between datasets by clicking on their names in the left-hand side panel.
To load new datasets, we can actually just drag & drop a file into the program interface, or click on Edit -> Import.
The DataFrame tab provides a snapshot of the dataset itself, which is pretty nice considering some IDEs don’t print all columns on the screen. Here we can see clearly what types of data we have.
The Statistics tab contains high-level summary of the data. For example, count, number of unique values, mean, min/max, etc. This tab provides similar information as pandas_profiling and sweetviz libraries.
Pandasgui unique features
pandasgui differentiates itself by giving us more flexibility to play around with the dataset. For example, in the “Grapher” tab, we can plot different charts using variables of our choice. And in the “Reshaper” tab, we can manipulate and reshape the data.
We can create filters easily – start by typing a column name, for example, HP, then select the item from the dropdown menu. Then we can set a criteria, for example, HP > 100. Then we’ll see that the data on DataFrame tab changed, now we can only see those pokemon with HP greater than 100.
Keep in minder that if we apply filters to a dataset, all the views and operations we do from now on will apply to the filtered data as opposed to the original dataset. If we click on the “Statistics” now, it’s showing the stats just for those pokemon with HP > 100. Same thing with the Grapher and Reshaper tabs. To view/operate on the original full dataset, remember to click off any filters.
This is one of my favorite features in pandasgui. We can simply modify data values in the DataFrame tab by selecting a cell then typing a new value. Everything we modify there will be stored and reflected automatically in the underlying dataframe. How cool is that! Here I changed Pikachu’s attack to be 100k, and then head to the Statistics tab, we’ll see that the max Attack from this dataset is also 100k.
In case you haven’t noticed, the charting is done by plotly.
I’m going to check out the Word Cloud chart, this is a good way to learn about pokemon for a newbie like myself. I drop the “type 2” in the value box, then I can tell that a lot of the pokemon have a “flying” attribute. Not sure if this is true (not an expert myself), feel free to leave a comment if you can confirm that!
The reshaper is pretty cool. We can use it to create pivot tables, or merge datasets, etc.
For example, we can make a pivot table using Type 1 as the index, then count as the aggregate function. Then hit Finish to generate the pivot table. The final results will show up in the DataFrame tab. Also, note as we create a pivot table, a new standalone dataframe appears on the left-hand side panel. Always pay attention to this panel to make sure that we have the right dataframe selected before doing operations.
- Pandasgui provides a lot of flexibility to users to interact with the data by plotting and reshaping the dataset.
- Work on multiple dataframes simultaneously, eg mergeing or contactinating datasets.
- Unlike pandas_profiling and sweetviz, pandasgui doesn’t provide a set of pre-defined analysis on each variables (data column).
- pandasgui also doesn’t provide the correlation coefficient matrix.