13.11. BeautifulSoup with Requests

BeautifulSoup makes it easy to extract the data you need from an HTML or XML page. You can download and install the BeautifulSoup library from:

https://pypi.python.org/pypi/beautifulsoup4

Information on installing BeautifulSoup with the Python Package Index tool pip is available at:

https://packaging.python.org/tutorials/installing-packages/

We will use the requests library to get a response object from a URL, create a BeautifulSoup object from the HTML in the response, then extract the href attributes from the anchor (a) tags. Anchor tags are also known as link tags.

This will find all of the ‘a’ tags and print the href for each of them.

The program reads the HTML page from “http://www.dr-chuck.com/page1.htm”, creates a BeautifulSoup object from the content of that HTML page, gets a list of the ‘a’ tags. It then loops through the list of ‘a’ tags and prints the ‘href’ attribute for it or ‘None’ if there isn’t an ‘href’ attribute.

You can use also BeautifulSoup to pull out various parts of each tag:

This will find the first ‘a’ tag and print the information for it.

The html.parser is the HTML parser that is included in the standard Python 3 library. Information on other HTML parsers is available at:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

13.11.1. How to Find Tags

Note

Use find to get the first tag that meets some criteria and find_all to get a list of all tags that meet some criteria.

You will typically first inspect a webpage to determine how to find what you are looking for in the page. You can do that with the developer tools in the Chrome browser. Click on the three dots on the top right of the page and then “More Tools” and then “Developer Tools”. You you can also just right-click on what you are interested in viewing on a webpage, and then click on “Inspect”.

Inspecting part of a webpage in the Chrome browser.

You will see the HTML source for the thing you inspected.

Inspecting part of a webpage in the Chrome browser.

You can use this information to find a parent tag such as the “div” tag that contains the “li” (list item) for each “a” tag for the mini nagivation bar in the New York Times webpage. You can then use find to find the “div” tag with the “css-1d8a290” class and then get all the “a” tags that are in the “div” tag.

Note

You must use class_ when looking for a tag with a particular class.

This will print the “href” for all the links in the mini nav header for the New York Times page.

You have attempted of activities on this page