Before you keep reading...
Runestone Academy can only continue if we get support from individuals like you. As a student you are well aware of the high cost of textbooks. Our mission is to provide great books to you for free, but we ask that you consider a $10 donation, more if you can or less if $10 is a burden.
Before you keep reading...
Making great stuff takes time and $$. If you appreciate the book you are reading now and want to keep quality materials free for other students please consider a donation to Runestone Academy. We ask that you consider a $10 donation, but if you can give more thats great, if $10 is too much for your budget we would be happy with whatever you can afford as a show of support.
13.13. Group Work on BeautifulSoup with Requests¶
It is best to use a POGIL approach with the following. In POGIL students work in groups on activities and each member has an assigned role. For more information see https://cspogil.org/Home.
If you work in a group, have only one member of the group fill in the answers on this page. You will be able to share your answers with the group at the bottom of the page.
Students will know and be able to do the following.
Import the necessary libraries
Use requests to get the HTML from a URL
Create a soup object from the HTML
find_allto get data from a soup object
class_to find data with a particular CSS class
Get the text of a tag using
Get the value for an attribute from a tag using
Put code in order.
Modify code to produce the correct output.
13.13.1. Getting a tag from a soup object¶
BeautifulSoup makes it easy to extract
the data you need from an HTML or XML page. It creates a soup object that
contains all the tags in the page. You can use
find_all to find
either the first of a type of a tag or a list of a type of tag.
We will use the
requests library to get a response object from a URL,
BeautifulSoup object from the HTML content in the response,
find to find the first paragraph tag, and then
print the first paragraph tag.
This will find and print the first paragraph tag from the Michigan Daily site. It will interpret the tag as HTML and show just the text of the tag.
html.parser is the HTML parser that is included in the standard Python 3 library.
It is used to parse the HTML into a tree of tags.
Information on other HTML parsers is available at:
Put the following blocks in order to print the second paragraph from the Michigan Daily website. It uses the
find_all method on
BeautifulSoup to get a list of all of the paragraphs.
13.13.2. Getting text from a tag¶
Some tags have text like a paragraph tag or a span tag. You can use
tagName.text to get the text.
You can also find a tag with a particular CSS class.
This will print the text for the site description paragraph.
When you specify a CSS class you must use
class_ as the keyword. This is becuase
class is already
a keyword that is used to define a new class in Python.
Put the following blocks in order to print the text for span tag which is a child of a h3 tag with a class of css-1pjbq1w.