Regular expressions in Python on Raspberry Pi
Regular expressions are a powerful tool to parse and validate text input
The file contents look disorganized and a bit overwhelming when viewed as a string, but if you download the HTML file and then open it in a web browser, you will discover that it is a very simple, interactive story.
'''Regular expressions in Python'''
import re
# Code to open the local file if you have downloaded it
with open("Good morning.html") as f:
file_text = f.read()
story_data = re.search('\
<tw-storydata.*?\
name="(?P<story_name>.*?)"\
.*</tw-storydata>', file_text)
print("Story name: '" + story_data.group("story_name") + "'")
While the regular expression above may look complicated, the goal is simple: we want to find and store the story name. Let's look at it step by step to see how it works:
- We start by searching for all of the story data. This is the text between the <tw-storydata> opening and closing HTML element
- To do this, we pass the re.search() method a string parameter that contains the regular expression to describe what we are seeking
- Start with a backslash character (\) to wrap to the next line to make the regular expression easier to read
- Add the <tw-storydata opening element that we are looking for
- The period character (.) indicates that we are looking for any character
- The asterisk character (*) modifies the search to look for any number of characters
- The question mark (?) modifies the search to be non-greedy, so the search will stop at the first instance of name=" encountered after the <tw-storydata opening element
- Another period + asterisk + question mark + closing angled bracket (.*?>) indicates a non-greedy search for a closing bracket
- The next step is to look for and capture the story name, so we search for name=" attribute then use the Python syntax for a named group syntax (?P<story_name>.*?)
- We round out the search with the closing element to validate that the structure conforms to our expectations.
Once the regular expression has run, we print out the story name by looking up the named group, which we called "story_name".
for passage in re.findall('\
<tw-passagedata.*?\
name="(.*?)"\
.*?>\
(.*?)\
</tw-passagedata>', file_text):
print(passage[0] + ": '" + passage[1] + "'")
The snippet of code finds and captures each passage within the story. Here's how it works:
- We loop through all the passages that are found match the pattern we define
- The <tw-passagedata> elements have a variety of attributes, but we only capture text within name="" by using capturing groups with the(.*?)
- We also capture the actual passage by doing a non-greedy search for the closing angle bracket .*?> then use another capturing group
- We then print the passage name and the actual passage text.
Ready to give it a try? Jump in and play with this example, complete with the code needed to import the HTML file from Google Drive.
Follow along with this video to help you see how to run this on a Raspberry Pi. Let us know if you have any questions by leaving a comment below.