Get table data from web page using Python pandas

Sharing is caring!

Last Updated on July 14, 2022 by Jay

“I thought this blog talks about integrate Python with Excel stuff, why getting data from web???” Don’t be surprised when you see the title of this post. Nowadays, people are connected to the Internet anytime and anywhere. The Internet is probably the largest public database out there, learning how to get data from the Internet is essential. That’s why I want to talk about how to get table data from web page using Python and the pandas library. Also if you are already using Excel PowerQuery, this is equivalent to the “Get Data From Web”, but 100x more powerful.

This article is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation through the series.

Get data from a website (web scraping)

HTML is the language behind every website. When we visit a website, what happens under the hood is like the following: 1. We type in an address (URL) into the browser’s address bar, the browser sends a request to the target website’s server. 2. Server receives the request and sends back the HTML code that composes the webpage. 3. Our browser receives the HTML code, run it on the fly and create a webpage for us to view.

So now you see, we humans see the beautiful web pages, but the machines only see code. And that is a good thing because code is easier to digest programmatically! Web scraping basically means that, instead of using a browser, we can use Python to send request to a website server, receive the HTML code, then extract the data we want.

We won’t cover too much HTML since this is not a web design tutorial, but I want to introduce the essentials so we have a basic understanding of how web sites & web scraping work. HTML elements, or “HTML tags” are certain keywords wrapped around with <>. For example, the following HTML code is the title of this web page. Hover your mouse over this tab and you should see the same title on the browser. Notice that most of HTML elements require an opening tag (e,g, <title>), and a corresponding closing tag (e.g. </title>)

<title>Get table data from web page using Python pandas (equivalent to PowerQuery get data from web) - Python In Office</title>
An example of web page title

Similarly, the following code will draw a table on your browser, you can try by copy & paste it into a notepad, then save it as a “table-example.html” file. You should be able to open it inside a browser. Some quick notes:

  • <table>...</table> draws the table
  • <tr>...</tr> draws a row inside the table
  • <th>...</th> indicates table header
  • <td>...</td> indicates table data
<html>
	<table>
		<tr>
			<th>User Name</th>
			<th>Country</th>
			<th>City</th>
			<th>Gender</th>
			<th>Age</th>
		</tr>

		<tr>
			<td>Forrest Gump</td>
			<td>USA</td>
			<td>New York</td>
			<td>M</td>
			<td>50</td>
		</tr>

		<tr>
			<td>Mary Jane</td>
			<td>CANADA</td>
			<td>Toronto</td>
			<td>F</td>
			<td>30</td>
		</tr>
	</table>
</html>

Requirements for using pandas for web scraping

Now you understand the basic building blocks of a website and how to interpret HTML (well, at least the table part!). The reason I only covered HTML table is because, most of the time when we try to get data from a website, it’s in a table format. And pandas is the perfect tool for getting the table format data from a website!

So the only requirement to use pandas to get data from a website is that the data has to be store inside a table, or, in HTML terms, within the <table>...</table> tags. pandas will be able to extract the table, headers and data rows using those HTML tags we covered just now.

If you try to use pandas to “extract data” from a webpage that doesn’t contain any table (<table>...</table> tags), you won’t be able to get any data. For those data not stored in a table, we need other ways to scrape the website.

Web Scraping in action

Our previous examples were mostly small tables with a few data points, let’s use something slightly bigger and more data to play with.

We are going to get the S&P 500 company names and stock symbols from wikipeida: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

>>> import pandas as pd
>>> df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
>>> type(df)
<class 'list'>
>>> len(df)
2

Our df above actually is a list, that’s interesting…and seems there are 2 items in that list. Let’s see what data pandas grabbed for us…

>>> df[0]
    Symbol                Security  ...      CIK      Founded
0      MMM              3M Company  ...    66740         1902
1      ABT     Abbott Laboratories  ...     1800         1888
2     ABBV             AbbVie Inc.  ...  1551152  2013 (1888)
3     ABMD             ABIOMED Inc  ...   815094         1981
4      ACN           Accenture plc  ...  1467373         1989
..     ...                     ...  ...      ...          ...
500    YUM         Yum! Brands Inc  ...  1041061         1997
501   ZBRA      Zebra Technologies  ...   877212         1969
502    ZBH  Zimmer Biomet Holdings  ...  1136869         1927
503   ZION           Zions Bancorp  ...   109380         1873
504    ZTS                  Zoetis  ...  1555280         1952

[505 rows x 9 columns]

>>> df[1]
                 Date  ...                            Reason
                 Date  ...                            Reason
0       June 22, 2020  ...  Market capitalization change.[6]
1       June 22, 2020  ...  Market capitalization change.[6]
2       June 22, 2020  ...  Market capitalization change.[6]
3        May 22, 2020  ...  Market capitalization change.[7]
4        May 12, 2020  ...  Market capitalization change.[8]
..                ...  ...                               ...
244  December 5, 2000  ...               Market Cap changes.
245  December 5, 2000  ...               Market Cap changes.
246  December 5, 2000  ...               Market Cap changes.
247     July 27, 2000  ...           Market Cap change.[202]
248  December 7, 1999  ...           Market Cap change.[203]

[249 rows x 6 columns]

It appears the first dataframe df[0] contains the S&P 500 list, and the second dataframe df[1] is another table on that page. Also notice that at the end of the first dataframe, it says [505 rows x 9 columns]. So I always thought there are only 500 companies in S&P 500, but not really!

Always check what’s returned from the pd.read_html(), a webpage might contain multiple tables so you’ll get a list of dataframe instead of a single dataframe!

That’s it for today. Going forward I’ll be using this dataframe of 505 companies to demonstrate other tricks in pandas.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *