Last Updated on July 14, 2022 by Jay
This tutorial will show you how to make a wordcloud in Python. Wordcloud is a type of visualization for text data. The below image is a wordcloud. Some words are bigger and bolder while others are smaller. Usually, the more often certain words are mentioned in the data, the bigger those words will appear in this visualization.
In the following wordcloud, the top three keywords are: “vehicle”, “energy” and “year”. Let’s make it now.
Install the following libraries using pip:
pip install wordcloud numpy matplotlib pillow
Wordcloud in Python
The text data is an excerpt from Telsa’s 2021 impact report that describes the company’s goals. For your convenience, I saved a copy of the text and the source code for this tutorial in this GitHub repository: https://github.com/pythoninoffice/blog_example_code/blob/main/wordcloud.ipynb
from wordcloud import WordCloud import numpy as np import matplotlib.pyplot as plt from PIL import Image text_data = '......' # see link to the source code
The wordcloud library is quite easy to use. It literally creates a wordcloud visualization in one line of Python code. (Not counting the code to show it)
Note the below code plt.axis(“off”) will hide axis, this is optional and only for better appearance purposes.
Also note to display the wordcloud, we need to use plt.imshow(), not the normal plt.show().
wc = WordCloud().generate(test_data) plt.axis('off') plt.imshow(wc)
The color and position of each word are randomized each time we run WordCloud().generate(). Below are a few examples:
To spice up the wordcloud, we can organize the words into any shape instead of just a rectangle.
I suggest using a black and white image for the best result, also we don’t need extra processing for the image. I found an image of the Apple logo – but you are free to use whatever image you want.
We’ll use the Pillow library to read the image into Python. To a computer, an image is just a matrix of integer numbers ranging from 0 to 255. The numpy library conveniently converts a Pillow image object into an np.array object. Note the [255,255,255] corresponds to the RGB color values. A value of [0,0,0] represents black, and a value of [255,255,255] represents white.
img_mask = np.array(Image.open(img_url)) img_mask array([[[255, 255, 255], [255, 255, 255], [255, 255, 255], ..., [255, 255, 255], [255, 255, 255], [255, 255, 255]], ..., [[255, 255, 255], [255, 255, 255], [255, 255, 255], ..., [255, 255, 255], [255, 255, 255], [255, 255, 255]]], dtype=uint8)
Note the above image, the apple shape is in black and background is in white – this is exactly how we want it. The area in white color is the “mask”. The wordcloud library will not show anything in the (white) masking area, at the same time, it will find a way to organize words inside the apple logo shape.
wc = WordCloud(width=1600, height=1600, mask= img_mask, background_color = 'white').generate(text_data) plt.figure(figsize=[10,10]) plt.axis("off") plt.imshow(wc)
We can also add a borderline (contour) around the words if you think the shape isn’t obvious enough. Simply pass in the contour_width and contour_color arguments into the WordCloud() constructor:
wc = WordCloud(width=1600, height=1600, mask= img_mask, background_color = 'white', contour_width=1, contour_color='red' ).generate(a) plt.figure(figsize=[10,10]) plt.axis("off") plt.imshow(wc)