12.3. Extracting Data Using Regular Expressions¶
If we want to extract data from a string in Python we can use the
findall() method to extract all of the substrings which
match a regular expression. Let’s use the example of wanting to extract
anything that looks like an email address from any line regardless of
format. For example, we want to pull the email addresses from each of
the following lines:
From firstname.lastname@example.org Sat Jan 5 09:14:16 2008 Return-Path: <email@example.com> for <firstname.lastname@example.org>; Received: (from apache@localhost) Author: email@example.com
We don’t want to write code for each of the types of lines, splitting
and slicing differently for each line. This following program uses
findall() to find the lines with email addresses in them
and extract one or more addresses from each of those lines.
Use findall to find the lines with email addresses in them and print them.
findall() method searches the string in the second
argument and returns a list of all of the strings that look like email
addresses. We are using a two-character sequence that matches a
non-whitespace character (
The output of the program is:
Translating the regular expression, we are looking for substrings that
have at least one non-whitespace character, followed by an at-sign,
followed by at least one more non-whitespace character. The
\S+ matches as many non-whitespace characters as
The regular expression would match twice (firstname.lastname@example.org and email@example.com), but it would not match the string “@2PM” because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:
From firstname.lastname@example.org Sat Jan 5 09:14:16 2008 Return-Path:
Received: from murder (mail.umich.edu [184.108.40.206]) by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA; Sat, 05 Jan 2008 09:14:16 -0500 X-Sieve: CMU Sieve 2.3 Received: from murder ([unix socket]) by mail.umich.edu (Cyrus v2.2.12) with LMTPA; Sat, 05 Jan 2008 09:14:16 -0500 Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [220.127.116.11]) by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674; Sat, 5 Jan 2008 09:14:15 -0500 Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [18.104.22.168]) BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 5 Jan 2008 09:14:10 -0500 Received: from paploo.uhi.ac.uk (localhost [127.0.0.1]) by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2; Sat, 5 Jan 2008 14:10:05 +0000 (GMT) Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Received: from prod.collab.uhi.ac.uk ([22.214.171.124]) by paploo.uhi.ac.uk (JAMES SMTP Server 2.1.3) with SMTP ID 899 for
This code searches for lines that have an at-sign (@) between characters.
We read each line and then extract all the substrings that match our
regular expression. Since
findall() returns a list, we
simply check if the number of elements in our returned list is more than
zero to print only lines where we found at least one substring that
looks like an email address.
Some of our email addresses have incorrect characters like “<” or “;” at the beginning or end. Let’s declare that we are only interested in the portion of the string that starts and ends with a letter or a number.
To do this, we use another feature of regular expressions. Square
brackets are used to indicate a set of multiple acceptable characters we
are willing to consider matching. In a sense, the
asking to match the set of “non-whitespace characters”. Now we will be a
little more explicit in terms of the characters we will match.
Here is our new regular expression:
This is getting a little complicated and you can begin to see why
regular expressions are their own little language unto themselves.
Translating this regular expression, we are looking for substrings that
start with a single lowercase letter, uppercase letter,
or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters
\S*), followed by an at-sign, followed by zero or more
non-blank characters (
\S*), followed by an uppercase or
lowercase letter. Note that we switched from
* to indicate
zero or more non-blank characters since
[a-zA-Z0-9] is already one
non-blank character. Remember that the
+ applies to the single
character immediately to the left of the plus or asterisk.
If we use this expression in our program, our data is much cleaner:
This code searches for lines that have an at-sign (@) between letter or number characters.
Notice that on the
email@example.com lines, our regular
expression eliminated two letters at the end of the string (“>;”).
This is because when we append
[a-zA-Z] to the end of our regular
expression, we are demanding that whatever string the regular expression
parser finds must end with a letter. So when it sees the “>” at the end of
“sakaiproject.org>;” it simply stops at the last “matching” letter it
found (i.e., the “g” was the last good match).
Also note that the output of the program is a Python list that has a string as the single element in the list.