Last Updated on July 14, 2022 by Jay
“I thought this blog talks about integrate Python with Excel stuff, why getting data from web???” Don’t be surprised when you see the title of this post. Nowadays, people are connected to the Internet anytime and anywhere. The Internet is probably the largest public database out there, learning how to get data from the Internet is essential. That’s why I want to talk about how to get table data from web page using Python and the pandas library. Also if you are already using Excel PowerQuery, this is equivalent to the “Get Data From Web”, but 100x more powerful.
This article is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation through the series.
Get data from a website (web scraping)
HTML is the language behind every website. When we visit a website, what happens under the hood is like the following: 1. We type in an address (URL) into the browser’s address bar, the browser sends a request to the target website’s server. 2. Server receives the request and sends back the HTML code that composes the webpage. 3. Our browser receives the HTML code, run it on the fly and create a webpage for us to view.
So now you see, we humans see the beautiful web pages, but the machines only see code. And that is a good thing because code is easier to digest programmatically! Web scraping basically means that, instead of using a browser, we can use Python to send request to a website server, receive the HTML code, then extract the data we want.
We won’t cover too much HTML since this is not a web design tutorial, but I want to introduce the essentials so we have a basic understanding of how web sites & web scraping work. HTML elements, or “HTML tags” are certain keywords wrapped around with <>
. For example, the following HTML code is the title of this web page. Hover your mouse over this tab and you should see the same title on the browser. Notice that most of HTML elements require an opening tag (e,g, <title>
), and a corresponding closing tag (e.g. </title>
)
<title>Get table data from web page using Python pandas (equivalent to PowerQuery get data from web) - Python In Office</title>
Similarly, the following code will draw a table on your browser, you can try by copy & paste it into a notepad, then save it as a “table-example.html” file. You should be able to open it inside a browser. Some quick notes:
<table>...</table>
draws the table<tr>...</tr>
draws a row inside the table<th>...</th>
indicates table header<td>...</td>
indicates table data
<html>
<table>
<tr>
<th>User Name</th>
<th>Country</th>
<th>City</th>
<th>Gender</th>
<th>Age</th>
</tr>
<tr>
<td>Forrest Gump</td>
<td>USA</td>
<td>New York</td>
<td>M</td>
<td>50</td>
</tr>
<tr>
<td>Mary Jane</td>
<td>CANADA</td>
<td>Toronto</td>
<td>F</td>
<td>30</td>
</tr>
</table>
</html>
Requirements for using pandas for web scraping
Now you understand the basic building blocks of a website and how to interpret HTML (well, at least the table part!). The reason I only covered HTML table is because, most of the time when we try to get data from a website, it’s in a table format. And pandas is the perfect tool for getting the table format data from a website!
So the only requirement to use pandas to get data from a website is that the data has to be store inside a table, or, in HTML terms, within the <table>...</table>
tags. pandas will be able to extract the table, headers and data rows using those HTML tags we covered just now.
If you try to use pandas to “extract data” from a webpage that doesn’t contain any table (<table>...</table>
tags), you won’t be able to get any data. For those data not stored in a table, we need other ways to scrape the website.
Web Scraping in action
Our previous examples were mostly small tables with a few data points, let’s use something slightly bigger and more data to play with.
We are going to get the S&P 500 company names and stock symbols from wikipeida: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
>>> import pandas as pd
>>> df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
>>> type(df)
<class 'list'>
>>> len(df)
2
Our df
above actually is a list, that’s interesting…and seems there are 2 items in that list. Let’s see what data pandas grabbed for us…
>>> df[0]
Symbol Security ... CIK Founded
0 MMM 3M Company ... 66740 1902
1 ABT Abbott Laboratories ... 1800 1888
2 ABBV AbbVie Inc. ... 1551152 2013 (1888)
3 ABMD ABIOMED Inc ... 815094 1981
4 ACN Accenture plc ... 1467373 1989
.. ... ... ... ... ...
500 YUM Yum! Brands Inc ... 1041061 1997
501 ZBRA Zebra Technologies ... 877212 1969
502 ZBH Zimmer Biomet Holdings ... 1136869 1927
503 ZION Zions Bancorp ... 109380 1873
504 ZTS Zoetis ... 1555280 1952
[505 rows x 9 columns]
>>> df[1]
Date ... Reason
Date ... Reason
0 June 22, 2020 ... Market capitalization change.[6]
1 June 22, 2020 ... Market capitalization change.[6]
2 June 22, 2020 ... Market capitalization change.[6]
3 May 22, 2020 ... Market capitalization change.[7]
4 May 12, 2020 ... Market capitalization change.[8]
.. ... ... ...
244 December 5, 2000 ... Market Cap changes.
245 December 5, 2000 ... Market Cap changes.
246 December 5, 2000 ... Market Cap changes.
247 July 27, 2000 ... Market Cap change.[202]
248 December 7, 1999 ... Market Cap change.[203]
[249 rows x 6 columns]
It appears the first dataframe df[0] contains the S&P 500 list, and the second dataframe df[1] is another table on that page. Also notice that at the end of the first dataframe, it says [505 rows x 9 columns]. So I always thought there are only 500 companies in S&P 500, but not really!
Always check what’s returned from the pd.read_html(),
a webpage might contain multiple tables so you’ll get a list of dataframe instead of a single dataframe!
That’s it for today. Going forward I’ll be using this dataframe of 505 companies to demonstrate other tricks in pandas.
One comment