Last Updated on October 4, 2021 by Jay
We’ll talk about data visualization & exploration with a Python library called plotly. There are many plotting libraries within the Python ecosystem. Some popular names are matplotlib, seaborn, ggplot, etc. When it comes to functionalities, plotly is one of my favorite. The first few seconds of the below video will show you what plotly is capable of. Can you believe the following chart is created by just 2 lines of Python code? Keep reading and I’ll show you how ????
Why should we be interested in visualization?
Because the human visual system is a pattern seeker of enormous power and subtlety. The eye and the visual cortext of the brain form a massively parallel processor that provides the highest-bandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated, which is the reason why the words “understanding” and “seeing” are synonymous…Colin Ware in Information Visualization: Perception for Design
What is plotly?
If you are not interested in the background and want to jump directly into coding & data visualization, skip this section.
Setting Up The Coding Environment
If you frequent my blog, you’ll notice that I use the (vanilla) Python IDLE in most of my articles. However, that’s not true when I’m working, and I’ll cover why in another post.
For this tutorial, we’ll use an IDE called Jupyter Notebook. If you need help installing Jupyter Notebook, check out this post. We’ll skip the setup here to keep this tutorial short.
It’s also recommended to use a virtual environment with plotly. Although using a virtual environment is a best practice, you still don’t have to if you find it too overwhelming (happened to me when I first started learning Python). Check out this article to help you get started with the virtual environment.
Install & Import
As always, pip install the libraries if you haven’t already. When installing multiple libraries at once, we can type them all after “pip install”, separated by a single space character.
pip install plotly ipywidgets
At a very high level, the plotly Python library consists of three tools. We can use them either for data visualization or exploration, or even making a beautiful dashboard.
- plotly express – this is what we’ll be using here
- plotly graph object
|Name||Good for…||Learning Curve (max 5)|
|plotly express||quick charting, data exploration||⭐|
|plotly graph object||full customization, data “story-telling”||⭐⭐|
|dash||interactive chart, web application||⭐⭐⭐⭐⭐|
By convention (as well as out of laziness), we import plotly express and name it px:
import plotly.express as px
We’ll be using the “gapminder” dataset. Gapminder is an organization founded by Swedish scientist Hans Rosling, who was probably the first person that demonstrated the power of data visualization via this TED talk. Their website has some pretty cool datasets if you want to check them out.
Conveniently, we can get the dataset directly from plotly express. And yep, inside a nicely organized pandas dataframe! We are going to use the Python plotly library to visualize and explore the dataset.
df = px.data.gapminder()
It looks like this in Jupyter Notebook, which makes data exploration and plotting jobs very convenient.
In case the column headers are not clear, here’s what they mean:
- lifeExp – average life expectancy for a given country in a give year
- pop – population of a country
- gdpPercap – GDP per capita, the higher this number, the richer the country
- iso_xxx – country codes
The Basic Plotly Data Attribute
I try to keep this tutorial beginner-friendly and easy to follow. However, understanding the fundamentals will help make the journey a lot easier going forward.
For now, let’s just remember there are two key data attributes required to define a plotly chart.
- data – text, numbers, coordinates, etc
- layout – how you want the chart to look like, we define the layout by “describing” how what we want to see on the chart
Data Exploration & Visualization
To see which continents and countries the dataset covers, we can show just the unique values by using the .unique() method. You can test it on the ‘country’ column.
Let’s start with something simple – draw a line chart to show the life expectancy over time for three countries: Canada, Mexico, United States.
px.line(df.loc[df['country'].isin(['Canada','Mexico', 'United States'])], x='year', y='lifeExp', color='country')
The first argument is a dataframe (with some filters to show just the three countries). Then the “x”, “y”, and “color” arguments are part of the “layout” data attribute. We have to tell plotly what data to place for the x and y axis, and how to distinguish the colors (by country in this case). The below is what we get, and if you hover the mouse over the lines, the “tooltip” box will show up with relevant information.
Boxplot And Violin Plot
If you come from a Statistics background like I do, you might appreciate the boxplot shown below.
px.box(df.loc[df['year'] == 2007], y='lifeExp', color='continent', hover_name = 'country')
A boxplot elegantly shows max, min, 25%, 50%, 75% percentiles and the outliers in one chart.
In case you are wondering, the outlier (blue) in Asia is Afghanistan; and the Americas (purple) outlier is Haiti. Having the “hover_name=’country'” argument allows us to see that information on the chart.
And you can combine a violin plot with a boxplot if you want to go a step further…
Another way to look at distribution is by using a histogram. The below shows that higher life expectancy is associated with higher wealth. I guess it makes sense, right?
In the below chart we use the argument facet_col = ‘continent’ to split out the chart into the 5 continents in the dataset. The histfunc = ‘avg’ controls the presentation of the y-axis, we can use one of ‘sum’, ‘count’, ‘avg’, ‘max’, or ‘min’
For scatter plot fans, simply drawing everything on the chart doesn’t really tell us anything – just a bunch of dots…
However, things get interesting as we play around with the arguments. Let’s add the following:
- size = ‘pop’ – use different sizes for the dots based on country’s population
- size_max = 100 – set the dot (bubble) size for the country with largest pop (China) to be 100, and scale the other countries accordingly
- log_x = True – log the x-axis, make the larger x-axis values appear closer on the chart, so the left-hand side of the chart becomes less crowded, and we can see clearer
- y_range =[20,90] – fix the y-axis range from age 20 to 90
Let’s re-draw the plot, do you see some kind of pattern forming, as if the “bubbles” trending towards the upper-right corner? Although it’s kinda hard to see, the largest size bubble is in Asia (blue), that’s probably China because we set the size to be based on population.
So far our chart is in 2 dimensions – GDP Per Capita vs Life Expectancy. Let’s add a third dimension: time! ????????
Not sure about you, but I wow’d the first time I saw this animated chart. Check the video embedded on the top of the article if you haven’t yet!
This single visualization shows how 142 countries evolved – as the countries get richer (higher gdpPercap), people tend to live longer lives, over half a century time frame.
“A picture is worth a thousand words”. Now I can see this is true.
How many lines of code?
This Python plotly library made data visualization and exploration a breeze. Technically, 2 lines of code to achieve the above animated plot, if you don’t believe me, try this yourself. It might appear as 2 or more lines on a small-medium size screen, if your screen is wide enough, you’ll see the below is indeed just 2 lines of Python code ????????
import plotly.express as px px.scatter(px.data.gapminder(), x='gdpPercap', y='lifeExp', color='continent', hover_name='country', size='pop', log_x=True, size_max=100, range_y=[20,90], animation_frame='year')