Python Programming/RegEx
This lesson introduces Python regular expression processing.
Objectives and Skills
editObjectives and skills for this lesson include:
- Standard Library
- Regular expression operations
Readings
editMultimedia
editExamples
editThe match() Method
editThe match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[1]
import re
string = "<p>HTML text.</p>"
match = re.match("<p>.*</p>", string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 0 end: 17 group: <p>HTML text.</p>
The search() Method
editThe search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[2]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
match = re.search("<p>.*</p>", string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 16 end: 33 group: <p>HTML text.</p>
Greedy vs. Non-greedy
editThe '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[3]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
match = re.search("<.*>", string)
if match:
print("Greedy")
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
match = re.search("<.*?>", string)
if match:
print("\nNon-greedy")
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
Greedy start: 0 end: 33 group: <h1>Heading</h1><p>HTML text.</p> Non-greedy start: 0 end: 4 group: <h1>
The findall() Method
editThe findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[4]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
matches = re.findall("<.*?>", string)
print("matches:", matches)
Output:
matches: ['<h1>', '</h1>', '<p>', '</p>']
The sub() Method
editThe sub() method replaces every occurrence of a pattern with a string.[5]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
string = re.sub("<.*?>", "", string)
print("string:", string)
Output:
string: HeadingHTML text.
The split() Method
editThe split() method splits string by the occurrences of pattern.[6]
import re
string = "cat: Frisky, dog: Spot, fish: Bubbles"
keys = re.split(": ?\w*,? ?", string)
values = re.split(",? ?\w*: ?", string)
print("string:", string)
print("keys:", keys)
print("values:", values)
Output:
string: cat: Frisky, dog: Spot, fish: Bubbles keys: ['cat', 'dog', 'fish', ''] values: ['', 'Frisky', 'Spot', 'Bubbles']
The compile() Method
editThe compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[7]
import re
string = "<p>Lines of<br>HTML text</p>"
regex = re.compile("<br>", re.IGNORECASE)
match = regex.search(string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 11 end: 15 group: <br>
Match Groups
editMatch groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[8]
import re
string = "<p>HTML text.</p>"
match = re.match("<p>(.*)</p>", string)
if match:
print("start:", match.start(1))
print("end:", match.end(1))
print("group:", match.group(1))
string = "'cat': 'Frisky', 'dog': 'Spot', 'fish': 'Bubbles'"
match = re.search("'cat': '(.*?)', 'dog': '(.*?)', 'fish': '(.*?)'", string)
if match:
print("groups:", match.group(1), match.group(2), match.group(3))
lst = re.findall(r"'(.*?)': '(.*?)',?\s*", string)
for key, value in lst:
print("%s: %s" % (key, value))
Output:
start: 3 end: 13 group: HTML text. groups: Frisky Spot Bubbles cat: Frisky dog: Spot fish: Bubbles
Activities
editTutorials
edit- Complete one or more of the following tutorials:
- LearnPython
- TutorialsPoint
- RegexOne
Practice
edit- Create a Python program that asks the user to enter a line of comma-separated grade scores. Use RegEx methods to parse the line and add each item to a list. Display the list of entered scores sorted in descending order and then calculate and display the high, low, and average for the entered scores. Include try and except to handle input errors.
- Create a Python program that asks the user for a line of text that contains HTML tags, such as:
<p><strong>This is a bold paragraph.</strong></p>
Use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a list. Print the untagged text and then display the list of removed tags sorted in alphabetical order with duplicate tags removed. Include error handling in case an HTML tag isn't entered correctly (an unmatched < or >). Use a user-defined function for the actual string processing, separate from input and output. - Create a Python program that asks the user to enter a line of dictionary keys and values in the form Key-1: Value 1, Key-2: Value 2, Key-3: Value 3. You may assume that keys will never contain spaces, but may contain hyphens. Values may contain spaces, but a comma will always separate one key-value pair from the next key-value pair. Use RegEx functions to parse the string and build a dictionary of key-value pairs. Then display the dictionary sorted in alphabetical order by key. Include input validation and error handling in case a user accidentally enters the same key more than once.
Lesson Summary
editRegEx Concepts
edit- A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.[9]
- Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.[10]
- In regex, | indicates either|or.[11]
- In regex, ? indicates there is zero or one of the preceding element.[12]
- In regex, * indicates there is zero or more of the preceding element.[13]
- In regex, + indicates there is one or more of the preceding element.[14]
- In regex, () is used to group elements.[15]
- In regex, . matches any single character.[16]
- In regex, [] matches any single character contained within the brackets.[17]
- In regex, [^] matches any single character not contained within the brackets.[18]
- In regex, ^ matches the start of the string.[19]
- In regex, $ matches the end of the string.[20]
- In regex, \w matches a word.[21]
- In regex, \d matches a digit.[22]
- In regex, \s matches whitespace.[23]
Python RegEx
edit- The Python regular expression library is re.py, and accessed using
import re
.[24] - The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[25]
- The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[26]
- The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[27]
- The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[28]
- The sub() method replaces every occurrence of a pattern with a string.[29]
- The split() method splits string by the occurrences of pattern.[30]
- The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[31]
- The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.[32]
- Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[33]
Key Terms
edit- brittle code
- Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. We call this “brittle code” because it is easily broken.[34]
- greedy matching
- The notion that the “+” and “*” characters in a regular expression expand outward to match the largest possible string.[35]
- grep
- A command available in most Unix systems that searches through text files looking for lines that match regular expressions. The command name stands for "Generalized Regular Expression Parser".[36]
- regular expression
- A language for expressing more complex search strings. A regular expression may contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.[37]
- wild card
- A special character that matches any character. In regular expressions the wild-card character is the period.[38]
Review Questions
edit-
A regular expression (abbreviated regex) is _____.A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.
-
Each character in a regular expression is either _____, or _____.Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.
-
In regex, | indicates _____.In regex,
-
In regex, ? indicates _____.In regex, ? indicates there is zero or one of the preceding element.
-
In regex, * indicates _____.In regex, * indicates there is zero or more of the preceding element.
-
In regex, + indicates _____.In regex, + indicates there is one or more of the preceding element.
-
In regex, () is used to _____.In regex, () is used to group elements.
-
In regex, . matches _____.In regex, . matches any single character.
-
In regex, [] matches _____.In regex, [] matches any single character contained within the brackets.
-
In regex, [^] matches _____.In regex, [^] matches any single character not contained within the brackets.
-
In regex, ^ matches _____.In regex, ^ matches the start of the string.
-
In regex, $ matches _____.In regex, $ matches the end of the string.
-
In regex, \w matches _____.In regex, \w matches a word.
-
In regex, \d matches _____.In regex, \d matches a digit.
-
In regex, \s matches _____.In regex, \s matches whitespace.
-
The match() method _____.The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.
-
The search() method _____.The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.
-
The '*', '+', and '?' quantifiers are all _____; they match _____. Adding ? after the quantifier makes it _____.The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
-
The findall() method _____.The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.
-
The sub() method _____.The sub() method replaces every occurrence of a pattern with a string.
-
The split() method _____.The split() method splits string by the occurrences of pattern.
-
The compile() method _____.The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.
-
The compile() method flags include _____.The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.
-
Match groups match _____.Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.
Assessments
edit- Flashcards: Quizlet: Python Regular Expressions
- Quiz: Quizlet: Python Regular Expressions
See Also
editReferences
edit- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions