Extract/Scrape financial data from websites using Excel & Python

Today I want to show you how to integrate Excel and Python. In a way that we can call Python functions within Excel spreadsheets. This kind of integration is powerful because it gives the best of both worlds – Excel’s simplicity and the power of Python! We will demonstrate this by building a stock tracker to extract/scrape financial data from websites using Excel & Python.

By the end of this tutorial, you will learn how to:

  1. Extract/scrape data from any website
  2. Call Python functions within a spreadsheet, using user-defined formulas in Excel

Part 1 – Web Scraping with Python

There are many ways to get financial data from the Internet, the easiest way is through an API. Still, we’ll leave that to another tutorial. Today we’ll scrape stock data from Yahoo Finance website using BeautifulSoup and requests. Once you learn this, you’ll be able to scrape data from any website.

A word of caution for scraping websites: be aware of the target website’s bandwidth limitations, don’t flood it like sending thousands of requests in a second. That will be considered a DoS attack, which is regarded as a malicious act.

Now back to scraping, I’m using Chrome for this tutorial, but you can use any internet browser.

Let’s find Apple’s stock information on Yahoo Finance. Here is the URL: https://finance.yahoo.com/quote/AAPL, which looks like this:

Apple stock information on Yahoo Finance

First, we want to get its price: $262.47. Select this number on Yahoo Finance’s website, right-click, then choose “Inspect”. This will bring up the Chrome developer tools, which reveal the underlying HMTL code of the site we are viewing. A little bit of HTML knowledge helps a lot here because all the data we are trying to find is in HTML, and we just need to know where to look.

HTML code in the Chrome DevTool

What we are interested in is a div tag name, as unique as possible since a unique value will help narrow down the choices. I’ve settled on <div id="quote-header-info"…>
line, but feel free to try other tags. The key to remember is that we need a tag (HTML code block) that includes the data we are trying to extract. We can see that the price is within our selected div tag. If you want to try other tags, the one I selected in yellow should also work <div id="Lead-3-QuoteHeader-Proxy">.

Installing Python libraries

We need to use three Python libraries. If you don’t have them already, use the following command line input to install them.

pip install requests
pip install bs4
pip install xlwings

Getting the underlying HTML document

We can use the requests library to get the entire HTML document of the page with 2 lines of code:

import requests
r = requests.get('https://finance.yahoo.com/quote/AAPL')

The requests library allows us to send HTTP requests easily to any server. The get() method returns a Response object. A value of 200 means OK, which indicates that we have made a request to the server and received some data back successfully.

Response object 200

The Response object contains a .content attribute, which literally means the text/content of the response. In this case, it’s the HTML code for the underlying website – Yahoo Finance. This text data is huge and we really don’t want to print it on the screen – it will hang your Python IDE. There’s no way we can extract data from such a large text data, so we need some help…

Text length of the Reponse object

Pulling data from the HTML document

Since we care about only the information we are trying to scrape, namely stock price, volume, and etc., we can use BeautifulSoup, which is a Python library for pulling data out of HTML files.

import bs4 as bs
soup = bs.BeautifulSoup(r.content,'lxml',from_encoding='utf-8')
price = soup.find_all('div',attrs={'id':'quote-header-info'})

The soup.find_all() method returns all the HTML code block that match the argument inside the parentheses. In our case, there’s only one of them, which is the code block <div id="quote-header-info"...>. Thanks to the unique tag value we picked earlier! Note this is not the only solution, so feel free to try other div tags. The key to remember is that you want a code block that includes the price.

HTML scrapped in Python

The above screenshot is the entire div block with id=”quote-header-info”. Price is within this block (green box). The object price appears to be a list type object that contains 1 item, so we can access the actual div block text using price[0], since Python index starts from 0. We also want to further extract only the price from this nonsensical block of text. Note in the <span> tag that contains the price, there’s an attribute data-reactid="14", we’ll take advantage of it.

price = soup.find_all('div',attrs={'id':'quote-header-info'})[0].find(attrs={'data-reactid':'14'})
Div that contains just the stock price

With a little bit assistance from the helper method .get_text(), we just extracted the current Apple stock price. Pay attention that this value is a string type.

price = soup.find_all('div',attrs={'id':'quote-header-info'})[0].find(attrs={'data-reactid':'14'}).get_text()
Stock price as a string

Let’s try to scrape a few other pieces of information from the same website. For a stock tracker, I’m also interested in Apple stocks’ volume and the next earning announcement date. Same technique here:

  1. On the webpage, select the data of interest, right click -> Inspect.
  2. Look for the block of HTML code that contains the data of interest. Pick one with a unique id or class.
  3. Use BeautifulSoup‘s engine to find the element that contains the data of interest. Then extract the text value.
volume = soup.find('td',attrs={'data-test':'TD_VOLUME-value'}).get_text()
er_dt = soup.find('td',attrs={'data-test':'EARNINGS_DATE-value'}).get_text()
Volume and earning date for APPLE stock

Good job! We have just completed the first part of the job! Next, let’s look at how to bring the data into Excel spreadsheet seamlessly with an Excel formula!

Putting it Together

I’m posting a full version of the code, so feel free to grab it here, or from Github. The code is more complicated than the example we walked through, but the core concept is the same. Note that I place all the code inside a function named get_stock(), which will return a list of data points we’d like to scrape. Note that the return value is called a list comprehension, which is essentially a Pythonic way to write a for loop in one line. Check out this tutorial if you want to learn about it.

import xlwings as xw
import bs4 as bs
import requests


@xw.func
def get_stock(ticker,args =[]):
    url_base = 'https://ca.finance.yahoo.com/quote/'
    src_base = requests.get(url_base+ticker).text
    src_profile = requests.get(url_base+ticker+r'/profile?p=' + ticker).text
    soup = bs.BeautifulSoup(src_base,'lxml')
    soup_profile = bs.BeautifulSoup(src_profile, 'lxml')
    

    try:
        div_info = soup.find('td',attrs={'data-test':'DIVIDEND_AND_YIELD-value'}).get_text()
        div_amt = div_info.split(' ')[0]
        div_pct = div_info.split(' ')[1].split('(')[1].split(')')[0]
    except:
        div_info = soup.find('td',attrs={'data-test':'TD_YIELD-value'}).get_text()
        div_amt = 'N/A'
        div_pct = div_info

    try:
        s_eps = soup.find('td',attrs={'data-test':'EPS_RATIO-value'}).get_text()
    except:
        s_eps = 'N/A'


    try:
        er_dt = soup.find('td',attrs={'data-test':'EARNINGS_DATE-value'}).get_text()
    except:
        er_dt = 'N/A'

    try:
        ex_div_dt = soup.find('td',attrs={'data-test':'EX_DIVIDEND_DATE-value'}).get_text()
    except:
        ex_div_dt = 'N/A'

    try:
        mkt_cap = soup.find('td', attrs={'data-test' : 'MARKET_CAP-value'}).get_text()
    except:
        mkt_cap = soup.find('td', attrs={'data-test' : 'NET_ASSETS-value'}).get_text()

        
    s_attrs = {'Co. Name' : soup.find_all('h1')[0].get_text().split(' - ')[1],
               'Price' : soup.find_all('div',attrs={'id':'quote-header-info'})[0].find(attrs={'data-reactid':'14'}).get_text(),
               'Volume' : soup.find('td',attrs={'data-test':'TD_VOLUME-value'}).get_text(),
               'Trailing PE' : soup.find('td',attrs={'data-test':'PE_RATIO-value'}).get_text(),
               'Trailing EPS' : s_eps,
               'Earning Date' : er_dt,
               'Dividend Amount' : div_amt,
               'Dividend %' : div_pct,
               'Ex Dividend Date' : ex_div_dt,
               'Market Cap' : mkt_cap,
               'Sector' : soup_profile.find('span', attrs ={'data-reactid' : '21'}).get_text(),
               }

    return [s_attrs[attr] for attr in args if attr in s_attrs]

Part 2 – Calling Python functions in Excel

Let me introduce another excellent tool – xlwings, which is a Python library that allows us to leverage the power of Python from and with Excel. With it, you can automate Excel spreadsheets, write macros in Python, or write user-defined functions (UDF).

Here, we only focus on how to create user-defined functions in Python and use them in Excel. Check out this tutorial if you need help with xlwings setup, or if you are interested in learning about how to automate Excel or write macros in Python.

Setting up the Excel file

  1. In Excel, open File -> Excel Options -> Trust Center -> Trusted Center Settings -> Macro Settings. Check the box to enable Trust access to the VBA project object model.
  2. Install the xlwings Excel addin, download the xlwings.xlam file here: https://github.com/xlwings/xlwings/releases
  3. Double click the downloaded xlwings.xlam file to install it. In Excel, open Developer -> Add-Ins, make sure the xlwings addin is properly installed.xlwings-Excel-addin
  4. Open the VBA editor (Developer -> Visual Basic or press Alt + F11).
  5. Check xlwings box under Tools -> References. Save and close the VBA editor.VBA-editor-xlwings
  6. Now you should see an Excel tab appear for xlwings.
  7. Save this Excel file on your computer. I’m naming my file as “Tracker.xlsb”.

Setting up the Python script

The setup in Python is a lot easier compared to what we just did in Excel. Since we are creating user defined functions (UDF), we need to write a function in Python, and the function has to return some data to us. We did this in part 1 of the tutorial. Then, follow the below steps:

  1. Inside the Python code, right above the Python function, add a decorator line @xw.func. This decorator will allow you to use call Python functions from Excel.
  2. Place this Python script in the same directory as the Excel file that we just saved.
  3. Name the Python script with the same name as the Excel file. The Excel file can be either .xlsm or .xlsb format.xlwings-folder-setup

Now the setup is complete. One last step we need to do is to load the Python function into Excel. We do this by clicking on the Import Functions in the xlwings tab in Excel. Remember, every time we make a change in the Python code, we need to re-import it here.

xlwings tab in Excel

It’s testing (and fun) time!

Our user defined function get_stock() can return multiple data points in Excel by using an array formula. If you don’t know how to enter an array formula, here’s what you need to do to create one, using the below screenshot as an example.

  1. Select a range in Excel (B4:L4)
  2. Enter the formula inside the formula bar (=get_stock(A4,$B$1:$L$1))
  3. Simultaneously press Ctrl + Shift + Enter

To re-cap, now you know how to:

  1. Scrape website with Python using requests and BeautifulSoup
  2. Use xlwings to create user defined functions (UDF) in Python and call them within Exce

Enjoy your new stock tracker spreadsheet!