13.13. BeautifulSoup with Requests¶
BeautifulSoup makes it easy to extract the data you need from an HTML or XML page.
We will use the requests
library to get a response object from a URL,
create a BeautifulSoup
object from the HTML in the response, then
print the first paragraph from the New York Times site.
This will find and print the first paragraph from the New York Times site.
We can also print all of the URLs on that page.
Again, we will use the requests
library to get a response object from a URL,
create a BeautifulSoup
object from the HTML in the response, get a list of all of the
anchor (a
) tags, then loop through the tags and
extract the href
attribute. Anchor tags are also known as link tags.
The program reads the HTML page from “http://www.dr-chuck.com/page1.htm”, creates a BeautifulSoup object from the content of that HTML page, gets a list of the ‘a’ tags. It then loops through the list of ‘a’ tags and prints the ‘href’ attribute for it or ‘None’ if there isn’t an ‘href’ attribute.
You can use also BeautifulSoup to pull out various parts of each tag:
This will find the first ‘a’ tag and print the information for it.
The html.parser
is the HTML parser that is included in the standard Python 3 library.
Information on other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser