Python Concepts/Regular Expressions

Objective

edit
 
  • What is a regular expression?
  • How to test a string for content that matches a regular expression?
  • How to retrieve content that matches a regular expression?
  • How to split a string at points in the string that match a given regular expression?
  • How to replace parts of a string that match a given regular expression?

Lesson

edit

A regular expression is a string. Python's re (regular expression) methods scan a string supplied to determine if the string supplied contains text matching the regular expression. If such text is found in the string supplied, the required action of the method may be to report it, to split the string at the text matching the regular expression, or to replace the text matching the regular expression.

A regular expression may be as simple as a few characters to be interpreted literally, eg, 'abc'. A regular expression may contain special characters that tell the regular expression method how to interpret the literal characters. eg, The expression 'abc*' matches 'a' + 'b' + any number of 'c'.

>>> import re
>>> 
>>> re.search(r'abc*', '123abDEF')
<_sre.SRE_Match object; span=(3, 5), match='ab'>
>>> 
>>> re.search(r'abc*', '123abcDEF')
<_sre.SRE_Match object; span=(3, 6), match='abc'>
>>> 
>>> re.search(r'abc*', '123abcccccccDEF')
<_sre.SRE_Match object; span=(3, 12), match='abccccccc'>
>>> 
>>> re.search(r'abc*', '123acccccccDEF')
>>>

Matching literal characters

edit

A regular expression may be as simple as one character. Search for 'e' within the string 'jumped':

>>> import re
>>> re.search('e', 'jumped')
<_sre.SRE_Match object; span=(4, 5), match='e'>
>>> 
>>> 'jumped'[4:5] == 'e'
True
>>>

Search for 'e' within the string 'jumped over everything':

>>> s1 = 'jumped over everything'
>>> re.search('e', s1)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 1st occurrence
>>> s1[4:5] == 'e'
True
>>> 
>>> s2 = s1[5:] ; s2
'd over everything'
>>> re.search('e', s2)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 2nd occurrence
>>> s2[4:5] == 'e'
True
>>> 
>>> s3 = s2[5:] ; s3
'r everything'
>>> re.search('e', s3)
<_sre.SRE_Match object; span=(2, 3), match='e'> # 3rd occurrence
>>> s3[2:3] == 'e'
True
>>> 
>>> s4 = s3[3:] ; s4
'verything'
>>> re.search('e', s4)
<_sre.SRE_Match object; span=(1, 2), match='e'> # 4th occurrence
>>> s4[1:2] == 'e'
True
>>> 
>>> s5 = s4[2:] ; s5
'rything'
>>> re.search('e', s5)
>>>

Method re.findall(....) produces a list of all matches found:

>>> L1 = re.findall('e', s1) ; L1
['e', 'e', 'e', 'e']
>>>

Iterating over matches found

edit
>>> print ('\n'.join([ str(p) for p in re.finditer('e', s1 ) ]))
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='e'>
<_sre.SRE_Match object; span=(14, 15), match='e'>
>>>
edit
>>> print ('\n'.join([ str(p) for p in re.finditer('R', s1, re.IGNORECASE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>>

Flag re.VERBOSE permits comments in the regular expression. Flags are combined with '|'. 're.IGNORECASE|re.VERBOSE' is read as 're.IGNORECASE or re.VERBOSE' (inclusive or).

>>> print ('\n'.join([ str(p) for p in re.finditer('R # looking for r or R', s1, re.IGNORECASE|re.VERBOSE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer('v # looking for v or V', s1.upper(), re.I|re.X ) ]))
<_sre.SRE_Match object; span=(8, 9), match='V'>
<_sre.SRE_Match object; span=(13, 14), match='V'>
>>>

Matching groups of characters

edit

Regular expressions can become complicated and unintelligible quickly. It may help to name the more common expressions. By naming expressions you can specify exactly what you want.

To match 'ee':

>>> pattern = 'e' * 2 ; pattern
'ee'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.') ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>

The special characters '{m,n}' cause the resulting RE to match from m to n repetitions of the preceding RE. Common matchings are:

>>> any = r'{0,}' # Match any number of the preceding RE.
>>> one_or_more = r'{1,}'  # Match one or more of the preceding RE.
>>> zero_or_one = r'{0,1}'  # Match zero or one of the preceding RE.
>>> 
>>> 'e' + any
'e{0,}' # Match any number of 'e'.
>>> 'e' + one_or_more
'e{1,}' # Match one or more of 'e'.
>>> 'e' + zero_or_one 
'e{0,1}' # Match zero or one of 'e'.
>>>

To match one or more of 'e':

>>> pattern = 'e' + one_or_more ; pattern
'e{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(8, 9), match='e'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>

Matching members of a set

edit

The string 'abc' means match 'abc' exactly. If 'abc' are members of a set (within brackets '[]'), the expression '[abc]' means 'a' or 'b' or 'c'.

Alpha-numeric

edit
>>> pattern = 'abcdefghijklmnopqrstuvwxyz';len(pattern)
26
>>> 
>>> lower = r'[' + pattern + r']' ; lower
'[abcdefghijklmnopqrstuvwxyz]'
>>> upper = r'[' + pattern.upper() + r']' ; upper
'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> alpha = r'[' + pattern + pattern.upper() + r']' ; alpha
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> numeric = r'[0123456789]' ; numeric
'[0123456789]'
>>> 
>>> alpha_numeric = alpha[:-1] + numeric[1:] ; alpha_numeric 
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>> word = r'[_' + alpha_numeric[1:] ; word
'[_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>>

Find all groups of alpha characters:

>>> pattern = alpha + one_or_more ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

Find all groups of numeric characters:

>>> pattern = numeric + one_or_more ; pattern
'[0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 1), match='1'>
<_sre.SRE_Match object; span=(2, 3), match='2'>
<_sre.SRE_Match object; span=(4, 5), match='3'>
>>>

Find all words in the string that contain the letters 'ee':

>>> pattern = alpha + any + 'ee' + alpha + any  ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}ee[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Beets'>
<_sre.SRE_Match object; span=(10, 15), match='sweet'>
>>>

Find all words in the string that contain at least 5 letters:

>>> pattern = alpha*5 + alpha + any ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

It's OK to be lazy. The important thing is to define the pattern accurately and then let the re method make sense of it. However, with a little practice you will probably write the above search as:

>>> pattern = alpha + r'{5,}' ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{5,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

Non alpha-numeric

edit

The caret '^' at the beginning of a set negates all the members of the set. '[^abc]' means any character that is not ('a' or 'b' or 'c').

>>> non_lower = r'[^' + lower[1:] ; non_lower
'[^abcdefghijklmnopqrstuvwxyz]'
>>> non_upper = r'[^' + upper[1:] ; non_upper
'[^ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> non_alpha = r'[^' + alpha[1:] ; non_alpha
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> non_numeric = r'[^' + numeric[1:] ; non_numeric
'[^0123456789]'
>>> 
>>> non_alpha_numeric = r'[^' + alpha_numeric[1:] ; non_alpha_numeric 
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>> 
>>> non_word = r'[^' + word[1:] ; non_word
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]'
>>>

Find all groups that contain non numeric characters:

>>> pattern = non_numeric + one_or_more ; pattern
'[^0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(1, 2), match=','>
<_sre.SRE_Match object; span=(3, 4), match=','>
<_sre.SRE_Match object; span=(5, 18), match=' are numeric.'>
>>>

Find all groups containing non alpha characters:

>>> pattern = non_alpha + one_or_more ; pattern
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 6), match='1,2,3 '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='.'>
>>>

White space

edit
>>> white = '[ \t\n\r\f\v]' ; white
'[ \t\n\r\x0c\x0b]'
>>> pattern = white + one_or_more ; pattern
'[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
>>>

Non white space

edit
>>> non_white = r'[^' + white[1:] ; non_white
'[^ \t\n\r\x0c\x0b]'
>>>

Find all blocks of non white space:

>>> pattern = non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>

Find all blocks of non white space that contain at least 4 letters:

>>> pattern = non_white*3 + non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>

Matching white space

edit

White space is any one of '\n', '\t', '\v', '\f', ' '.

The regular expression that means 'any white character' is '[\n\t\v\f ]'. It may help to name the most common regular expressions:

>>> new_line = '''
... '''
>>> white = '[' + new_line + '\t\v\f ]' ; white
'[\n\t\x0b\x0c ]'
>>>

Some special characters that tell the methods how to interpret the other characters in the regular expression are:

>>> any = r'*' # any number of
>>> one_or_more = r'+' # one or more of
>>> zero_or_one = r'?' # zero or one of
>>> 
>>> white + any # any number of white characters
'[\n\t\x0b\x0c ]*'
>>> white + one_or_more # one or more white characters
'[\n\t\x0b\x0c ]+'
>>> white + zero_or_one # zero or one white characters
'[\n\t\x0b\x0c ]?'
>>>

Searching for white space:

>>> s1 = '\v\n \t abcd          EFG \v\t   \n\n  234  \f\f\n' # 4 blocks of white space.
>>> 
>>> re.search(white + one_or_more, s1)
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '> # 1st block
>>> 
>>> re.search(white + one_or_more, s1[5:])
<_sre.SRE_Match object; span=(4, 14), match='          '> # 2nd block.
>>> 
>>> re.search(white + one_or_more, s1[5:][14:])
<_sre.SRE_Match object; span=(3, 13), match=' \x0b\t   \n\n  '> # 3rd block.
>>> 
>>> re.search(white + one_or_more, s1[5:][14:][13:])
<_sre.SRE_Match object; span=(3, 8), match='  \x0c\x0c\n'> # 4th block
>>> 
>>> re.search(white + one_or_more, s1[5:][14:][13:][8:])
>>> # no more.
>>> 5+14+13+8 == len(s1)
True
>>> L1 = re.findall(white + one_or_more, s1) ; L1
['\x0b\n \t ', '          ', ' \x0b\t   \n\n  ', '  \x0c\x0c\n'] # 4 blocks of white space.
>>>

Iterating over matches found:

>>> for p in re.finditer(white + one_or_more, s1 ) :
...     print (p)
... 
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '>
<_sre.SRE_Match object; span=(9, 19), match='          '>
<_sre.SRE_Match object; span=(22, 32), match=' \x0b\t   \n\n  '>
<_sre.SRE_Match object; span=(35, 40), match='  \x0c\x0c\n'>
>>>

Anchoring the pattern:

>>> beginning = r'^' # Anchor pattern at beginning of string.
>>> end = r'$' # Anchor pattern at end of string.
>>> 
>>> beginning + white + one_or_more # 1 or more white characters at beginning of string.
'^[\n\t\x0b\x0c ]+'
>>> 
>>> white + one_or_more + end # 1 or more white characters at end of string.
'[\n\t\x0b\x0c ]+$'
>>>

Searching for white space at extremities of string:

>>> L2 = re.findall(white + one_or_more + end, s1) ; L2
['  \x0c\x0c\n']
>>> L2[0] == L1[-1]
True
>>> L3 = re.findall(beginning + white + one_or_more, s1) ; L3
['\x0b\n \t ']
>>> L3[0] == L1[0]
True
>>>

Splitting on white space

edit
>>> s1 = '  \n \t  \n   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'
>>> print (s1)
  
 	  
   line 1a
  line 1b

	  
  line 2a
    line 2b   
  		

>>>

Remove white space from beginning of s1, but preserve white space at beginning of line 1a:

>>> pattern = beginning + white + any + new_line ; pattern
'^[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s1)
['', '   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n']
>>> s2 = re.split(pattern, s1)[1] ; s2
'   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'

Remove white space from end of s2, but preserve white space at end of line 2b:

>>> pattern = new_line + white + any + end ; pattern
'\n[\n\t\x0b\x0c ]*$'
>>> re.split(pattern, s2)
['   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   ', '']
>>> s3 = re.split(pattern, s2)[0] ; s3
'   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   '

Split s3 into paragraphs:

>>> pattern = new_line + white + any + new_line ; pattern
'\n[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s3)
['   line 1a\n  line 1b', '  line 2a\n    line 2b   ']
>>> paragraphs = re.split(pattern, s3) ; paragraphs
['   line 1a\n  line 1b', '  line 2a\n    line 2b   ']

Produce s4, equivalent to s1 without extraneous white space:

>>> s4 = '\n\n'.join(paragraphs) + new_line ; s4
'   line 1a\n  line 1b\n\n  line 2a\n    line 2b   \n'
>>> print (s4,end='')
   line 1a
  line 1b

  line 2a
    line 2b   
>>>

Special characters

edit

Special characters are sometimes called metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

Special characters '[]'

edit

Brackets contain members of a class:

>>> alpha
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]' # Any character found in the English alphabet.
>>>

Special characters '{}'

edit

Braces indicate a range:

e{17} # Match exactly 'e' * 17

[0123456789]{3,} # Match 3 or more numeric characters.

[abc]{3,5} # Match 3 or 4 or 5 of ('a' or 'b' or 'c')

p{,3} # Match 0 or 1 or 2 or 3 of 'p'.

Special characters '()'

edit

Parentheses define a group. The method matches whatever regular expression is inside the parentheses. The contents of the group can be matched later in the expression with the \number special sequence, or can be retrieved after the method terminates with match objects match.groups() or match.group().

Matching contents of group

edit

The \number special sequence matches contents of the group of the same number, where number is limited to   and a sequence such as \  where   is an octal digit, will be interpreted as the character with octal value \ 

Groups are numbered starting from 1.

>>> [ p[0] for p in re.finditer(r'(\w+) \1', '123 234 234 234 345 456') ]
['23 23', '234 234', '45 45']
>>> 
>>> [ p[0] for p in re.finditer(r'(\w+)\s+(\w+) \1', '123 234 234 234 345 456') ]
['23 234 23', '4 345 4']
>>> 
>>> [ p[0] for p in re.finditer(r'(\w+)\s+(\w+) \2', '123 234 234 234 345 456') ]
['123 234 234']
>>>

Retrieving contents of group

edit
>>> print (pattern1)
                                              
[5432]{3} # 3 of ('5' or '4' or '3' or '2')                  
\ {1,}    # 1 or more spaces                                 
[6789]{1,}# 1 or more of ('6' or '7' or '8' or '9')          

>>> 
>>> m = re.search(pattern1, '        2345      9876    ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(9, 22), match='345      9876'>
>>> m.lastindex
>>> m.group(0)
'345      9876'
>>> 
>>> m.groups()
()
>>> 
>>> print (pattern2)
                                              
([5432]{3}) # 3 of ('5' or '4' or '3' or '2'). Note the '()' around group '[5432]{3}'.
\ {1,}      # 1 or more spaces                               
([6789]{1,})# 1 or more of ('6' or '7' or '8' or '9'). Note the '()' around group '[6789]{1,}'.

>>> 
>>> m = re.search(pattern2, '        2345      9876    ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(9, 22), match='345      9876'>
>>> m.lastindex
2
>>> m.group(0)
'345      9876'
>>> m.group(1)
'345'
>>> m.group(2)
'9876'
>>> 
>>> m[0] ; m[1] ; m[2]
'345      9876'
'345'
'9876'
>>> 
>>> m.groups()
('345', '9876')
>>>

Named groups

edit

In the following regular expression the syntax (?P<name>...) identifies a named group.

>>> m = re.match(r"(?P<adjective>\w+) (?P<noun>\w+)", "big flag") ; m
<_sre.SRE_Match object; span=(0, 8), match='big flag'>
>>>
>>> m.groups()
('big', 'flag')
>>> 
>>> m.lastindex
2
>>> m.group(0,1,2)
('big flag', 'big', 'flag')
>>> 
>>> m[0] ; m[1] ; m[2]
'big flag'
'big'
'flag'
>>> m['adjective'] ; m['noun'] 
'big'
'flag'
>>> m.groupdict()
{'adjective': 'big', 'noun': 'flag'}

An attempt to use optional parameter re.VERBOSE produces strange results:

>>> m = re.match(r"(?P<adjective>\w+      ) (?P<noun>\w+)", "big flag", re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 3), match='big'>
>>> m.groups()
('bi', 'g')
>>> m.groupdict()
{'adjective': 'bi', 'noun': 'g'}
>>>

Optional parameter re.VERBOSE works well provided that additional white space does not change the definition of a token:

>>> m = re.match(r"  (?P<adjective>\w+) \s (?P<noun>\w+)  ", "big flag", re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 8), match='big flag'>
>>>

Special characters '*', '+', '?'

edit

Special character '*' means 'any number of'. The following are equivalent:

p*    p{0,}                      # Any number of 'p'.
[01234567890]* [01234567890]{0,} # Any number of numeric.

Special character '+' means '1 or more of'. The following are equivalent:

p+    p{1,}                      # 1 or more of 'p'.
[01234567890]+ [01234567890]{1,} # 1 or more of numeric.

Special character '?' means '0 or 1 of'. The following are equivalent:

p?    p{0,1}                      # 0 or 1 of 'p'.
[01234567890]? [01234567890]{0,1} # 0 or 1 of numeric.

Special characters '^', '$'

edit

Special character '^' anchors the search at the beginning of the string.

>>> m = re.search(r'234', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(8, 11), match='234'>
>>> m = re.search(r'^234', '        2345 9876    ') ; m
>>> # No match. '234' not at beginning of string.
>>> m = re.search(r'^\ {1,}234', '        2345 9876    ') ; m # '\ {1,}' 1 or more spaces allowed at beginning of string.
<_sre.SRE_Match object; span=(0, 11), match='        234'>
>>> m = re.search(r'^\ +234', '        2345 9876    ') ; m # Same as above.
<_sre.SRE_Match object; span=(0, 11), match='        234'>
>>>

Special character '$' anchors the search at the end of the string.

>>> m = re.search(r'876', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(14, 17), match='876'>
>>> m = re.search(r'876$', '        2345 9876    ') ; m
>>> # No match. '876' not at end of string.
>>> m = re.search(r'876\ +$', '        2345 9876    ') ; m # '\ {1,}' 1 or more spaces allowed at end of string.
<_sre.SRE_Match object; span=(14, 21), match='876    '>
>>>

When both '^$' are used, the regular expression must match the whole string.

>>> m = re.search(r'2345 9876', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(8, 17), match='2345 9876'>
>>> m = re.search(r'^2345 9876$', '        2345 9876    ') ; m
>>> # No match.
>>> m = re.search(r'^\ *2345 9876\ *$', '        2345 9876    ') ; m # Regular expression permits white space at beginning and end of string.
<_sre.SRE_Match object; span=(0, 21), match='        2345 9876    '>
>>>

Special character '^'

edit

When the caret is the first character in a set, it negates the whole set.

[0123456789] # Any numeric character.
[^0123456789] # Any non-numeric character.

Special character '.'

edit

In the default mode, this matches any character except a newline. It is equivalent to:

>>> not_new_line = r'[^' + '\n' + r']' ; not_new_line 
'[^\n]'
>>>

Display all lines in the string s1:

>>> s1 = '  \n \t  \n   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'
>>> 
>>> pattern = not_new_line + one_or_more ; pattern
'[^\n]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, s1 ) ]))
<_sre.SRE_Match object; span=(0, 2), match='  '>
<_sre.SRE_Match object; span=(3, 7), match=' \t  '>
<_sre.SRE_Match object; span=(8, 18), match='   line 1a'>
<_sre.SRE_Match object; span=(19, 28), match='  line 1b'>
<_sre.SRE_Match object; span=(30, 33), match='\t  '>
<_sre.SRE_Match object; span=(34, 43), match='  line 2a'>
<_sre.SRE_Match object; span=(44, 58), match='    line 2b   '>
<_sre.SRE_Match object; span=(59, 63), match='  \t\t'>
>>> 
>>> print ('\n'.join([ str(p.span()) for p in re.finditer(pattern, s1 ) ]))
(0, 2)
(3, 7)
(8, 18)
(19, 28)
(30, 33)
(34, 43)
(44, 58)
(59, 63)
>>> 
>>> print ('\n'.join([ p.group() for p in re.finditer(pattern, s1 ) ]))
  
 	  
   line 1a
  line 1b
	  
  line 2a
    line 2b   
  		
>>>

Escaped special characters

edit

\s and \S

edit

Special character \S means any non white space character. Special character \s means any white space character. The following match \s:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\s+', s1 ) ]))
<_sre.SRE_Match object; span=(9, 14), match='\t\n\x0b\x0c\r'>
<_sre.SRE_Match object; span=(28, 33), match='\x1c\x1d\x1e\x1f '>
<_sre.SRE_Match object; span=(133, 134), match='\x85'>
<_sre.SRE_Match object; span=(160, 161), match='\xa0'>
>>>

\d and \D

edit

Special character \D means any non numeric character. Special character \d means any numeric character. The following match \d:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\d+', s1 ) ]))
<_sre.SRE_Match object; span=(48, 58), match='0123456789'>
>>>

\w and \W

edit

Special character \W means any non word character, where "word" is a word in Python. Special character \w means any word character. The following 134 characters match \w:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', s1 ) ]))
<_sre.SRE_Match object; span=(48, 58), match='0123456789'>
<_sre.SRE_Match object; span=(65, 91), match='ABCDEFGHIJKLMNOPQRSTUVWXYZ'>
<_sre.SRE_Match object; span=(95, 96), match='_'>
<_sre.SRE_Match object; span=(97, 123), match='abcdefghijklmnopqrstuvwxyz'>
<_sre.SRE_Match object; span=(170, 171), match='ª'>
<_sre.SRE_Match object; span=(178, 180), match='²³'>
<_sre.SRE_Match object; span=(181, 182), match='µ'>
<_sre.SRE_Match object; span=(185, 187), match='¹º'>
<_sre.SRE_Match object; span=(188, 191), match='¼½¾'>
<_sre.SRE_Match object; span=(192, 215), match='ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ'>
<_sre.SRE_Match object; span=(216, 247), match='ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö'>
<_sre.SRE_Match object; span=(248, 256), match='øùúûüýþÿ'>
>>>

Some words in English carry an accent: 'fiancée', 'café', 'naïve'. Special character '\w' matches all letters in these words.

>>> [p for p in ('fiancée', 'café', 'naïve') if re.search(r'^\w+$', p) ]
['fiancée', 'café', 'naïve']
>>>

To limit special character '\w' to ASCII characters:

>>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', s1, re.ASCII ) ]))
<_sre.SRE_Match object; span=(48, 58), match='0123456789'>
<_sre.SRE_Match object; span=(65, 91), match='ABCDEFGHIJKLMNOPQRSTUVWXYZ'>
<_sre.SRE_Match object; span=(95, 96), match='_'>
<_sre.SRE_Match object; span=(97, 123), match='abcdefghijklmnopqrstuvwxyz'>
>>>

To produce words instead of match objects:

>>> [ p[0] for p in re.finditer(r'\w+', s1, re.ASCII ) ]
['0123456789', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', '_', 'abcdefghijklmnopqrstuvwxyz']
>>>

International characters

edit

The methods work with international characters:

>>> pattern = white + any + 'στο' + white + one_or_more  ; pattern
'[ \t\n\r\x0c\x0b]{0,}στο[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(12, 17), match=' στο '>
>>>

Find all words that contain the letter 'α' (Greek alpha):

>>> pattern = non_white + any + 'α' + non_white + any  ; pattern
'[^ \t\n\r\x0c\x0b]{0,}α[^ \t\n\r\x0c\x0b]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
>>>

List all the words in the string:

>>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
<_sre.SRE_Match object; span=(13, 16), match='στο'>
<_sre.SRE_Match object; span=(17, 30), match='Βικιεπιστήμιο'>
>>>

The special character '\w' matches any word character in both English and Greek.

Matching '^', '$', '*', '+', '?' literally

edit

Within a set (within brackets '[]') special characters lose their special significance. To search for a '$' literally search for r'[$]':

>>> pattern = r'[$]' + one_or_more ; pattern 
'[$]+' # One or more of '$' literally.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['$$$$$']
>>> 
>>> pattern = r'[*]' + one_or_more + r'[$]' + any; pattern
'[*]+[$]*' # One or more of '*' and any number of '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['***', '**', '**$$$$$']
>>>

Characters listed individually within brackets '[]':

>>> pattern = r'[2aX?*$]' ; pattern
'[2aX?*$]' # '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '*', '*', '*', '2', '*', '*', 'X', '?', '?', '*', '*', '$', '$', '$', '$', '$', '?', '?', '?', '?', '?']
>>> 
>>> pattern = r'[2aX?*$]' + one_or_more ; pattern
'[2aX?*$]+' # One or more of '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '***', '2', '**X', '??', '**$$$$$?????']
>>>

The caret '^' has a special meaning when it is the first in a set. Match all characters not in the set. To match a caret:

>>> pattern = r'[\^]' + one_or_more ; pattern
'[\\^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['^^']
>>>

or put it after first place in the set:

>>> pattern = r'[$?^]' + one_or_more ; pattern
'[$?^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['??', '$$$$$??^^???']
>>>

Characters that may have a special meaning within a set include '^', ']', '|', '\'. For consistent results every time:

>>> pattern = r'123^]?*\ '[:-1] ; pattern
'123^]?*\\' # Backslash at end.
>>> 
>>> pattern = r'[' + re.escape(pattern) + r']' ; pattern
'[123\\^\\]\\?\\*\\\\]'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['*', '*', '*', '1', '2', '3', '*', '*', '?', '?', '*', '*', '?', '?', '^', '^', '?', '?', '?', ']', ']', ']', '\\', '\\', '\\']
>>> 
>>> pattern = pattern + one_or_more ; pattern
'[123\\^\\]\\?\\*\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***123**', '??', '**', '??^^???', ']]]', '\\\\\\']
>>> 
>>> pattern = r'^3]?*}{)(\ '[:-1] ; pattern
'^3]?*}{)(\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\^3\\]\\?\\*\\}\\{\\)\\(\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***', '3**', '??', '**', '??^^???', ']]]', '(((', ')))', '{{{', '}}}}', '\\\\\\']
>>> 
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern
'\'"^3]?}\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of "'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash.
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['3', '??', '??^^???', ']]]', "''''", '}}}}', '\\\\\\']
>>> 
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern # Carefully define the pattern.
'\'"^3]?}\\'
>>> pattern = r'[^' + re.escape(pattern) + r']' + one_or_more ; pattern # Build the regular expression.
'[^\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of any character that is not ("'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash).
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['abc***12', '**XYZ', 'q**$$$$$', ' [[[ ', ' ((( )))', ' {{{ ', ' ||| ', ' ']
>>>

You can see that regular expressions can become complicated and unintelligible quickly.


Pattern escaped:

>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern
'\'"^3]?}\\'
>>> L1 = list(pattern) ; L1
["'", '"', '^', '3', ']', '?', '}', '\\'] # Each member of L1 is one character.
>>> 
>>> pattern_escaped = re.escape(pattern) ; pattern_escaped 
'\\\'\\"\\^3\\]\\?\\}\\\\'
>>> r'''\'\"\^3\]\?\}\\''' == pattern_escaped == r"\'" + r'\"' + r'\^' + '3' + r'\]' + r'\?' + r'\}' + r'\\'
True # All characters in pattern except A-Za-z0-9_ have been escaped.
>>>

Advanced Regular Expressions

edit

Matching dates

edit

A date has format 7/4/1776 or July 4, 1776. Liberal use of white space is acceptable, as is a month of 3 characters. The following are acceptable dates:

3 /9   / 1923
11/ 22/  1987
Aug23,2017
Septe 4  ,  2001

The ultimate regular expression will be pattern1 | pattern2.

pattern1 = r'''             
\b        # word boundary   
\d{1,2}   # 1 or 2 numeric  
\s*       # any white       
/                           
\s*       # any white       
\d{1,2}   # 1 or 2 numeric  
\s*       # any white       
/                           
\s*       # any white       
\d{4}     # 4 numeric       
\b        # word boundary   
'''

pattern2 = r'''                     
\b        # word boundary           
''' + upper + lower + r'''{2,} # upper + 2 or more lower               
\s*       # any white               
\d{1,2}   # 1 or 2 numeric          
\s*       # any white               
,                                   
\s*       # any white               
\d{4}     # 4 numeric               
\b        # word boundary           
'''

pattern = pattern1 + '|' + pattern2

print (pattern)
\b        # word boundary
\d{1,2}   # 1 or 2 numeric
\s*       # any white
/
\s*       # any white
\d{1,2}   # 1 or 2 numeric
\s*       # any white
/
\s*       # any white
\d{4}     # 4 numeric
\b        # word boundary
|
\b        # word boundary
[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,} # upper + 2 or more lower
\s*       # any white
\d{1,2}   # 1 or 2 numeric
\s*       # any white
,
\s*       # any white
\d{4}     # 4 numeric
\b        # word boundary

The above verbose format is much more readable than:

r'''\b\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}\b|\b[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,}\s*\d{1,2}\s*,\s*\d{4}\b'''
s3 = '''   7/4 / 1776   3/2/2001     12  / 19
 / 2007
  Jul4,1776  July 4 , 1776    xbcvgdf  ,,
 vnhgb   August13  ,2003...  Nove 22,  2007,,,February14,1776  '''

print ('\n\n', '\n'.join([ str(p.group()) for p in re.finditer(pattern, s3 , re.VERBOSE) ]), sep='')
7/4 / 1776
3/2/2001
12  / 19
 / 2007
Jul4,1776
July 4 , 1776
August13  ,2003
Nove 22,  2007
February14,1776

Unix dates

edit

A "Unix date" has format

$ date
Wed Feb 14 08:24:24 CST 2018

In this section a regular expression to match a Unix date will accept

Wed Feb 14 08:24:24 CST 2018
Wednes Feb 14 08:24:24 CST 2018 # More than 3 letters in name of day.
Wed Febru 14 08:24:24 CST 2018 # More than 3 letters in name of month.
Wed Feb 14 8:24 : 24 CST 2018 # White space in hh:mm:ss.
wed FeB 14 8:24 : 24 cSt 2018 # Bad punctuation.

Build parts of the regular expression.

mo='''January February March April                              
May June July August September                                  
October November December                                       
'''

s1 = '|\n'.join([
    '|'.join([ month[:p] for p in range (len(month), 2, -1) ])
    for month in mo.title().split()
])

print (s1)
January|Januar|Janua|Janu|Jan|
February|Februar|Februa|Febru|Febr|Feb|
March|Marc|Mar|
April|Apri|Apr|
May|
June|Jun|
July|Jul|
August|Augus|Augu|Aug|
September|Septembe|Septemb|Septem|Septe|Sept|Sep|
October|Octobe|Octob|Octo|Oct|
November|Novembe|Novemb|Novem|Nove|Nov|
December|Decembe|Decemb|Decem|Dece|Dec
da='''Sunday Monday Tuesday                                      
Wednesday Thursday Friday                                        
Saturday                                                         
'''

s2 = '|\n'.join([
    '|'.join([ day[:p] for p in range (len(day), 2, -1) ])
    for day in da.title().split()
])

print (s2)
Sunday|Sunda|Sund|Sun|
Monday|Monda|Mond|Mon|
Tuesday|Tuesda|Tuesd|Tues|Tue|
Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed|
Thursday|Thursda|Thursd|Thurs|Thur|Thu|
Friday|Frida|Frid|Fri|
Saturday|Saturda|Saturd|Satur|Satu|Sat

Build the regular expression.

reg2 = (
r'''\b # Word boundary.                                                 
(?P<day>                                                                
''' + s2 + r'''                                                         
)                                                                       
\s+                                                                     
(?P<month>                                                              
''' + s1 + r'''                                                         
)                                                                       
\s+                                                                     
(?P<date>  ([1-9])  |  ([12][0-9])  |  (3[01])  ) # 1 through 31        
\s+                                                                     
(?P<hours>   ((0{0,1}|1)[0-9])  |  (2[0-3])  ) # (0 or 00) through 23     
\s*\:\s*                                                                
(?P<minutes>  [0-5]{0,1}[0-9]  ) # (0 or 00) through 59                 
\s*\:\s*                                                                
(?P<seconds>  [0-5]{0,1}[0-9]  ) # (0 or 00) through 59                 
\s+                                                                     
(?P<time_zone>  [ECMP][SD]T  )                                          
\s+                                                                     
(?P<year>  (19[0-9][0-9])  |  (20[01][0-9])  ) # 1900 through 2019      
\b''' # Word boundary.                                                  
)

print (reg2)
\b # Word boundary.
(?P<day>
Sunday|Sunda|Sund|Sun|
Monday|Monda|Mond|Mon|
Tuesday|Tuesda|Tuesd|Tues|Tue|
Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed|
Thursday|Thursda|Thursd|Thurs|Thur|Thu|
Friday|Frida|Frid|Fri|
Saturday|Saturda|Saturd|Satur|Satu|Sat
)
\s+
(?P<month>
January|Januar|Janua|Janu|Jan|
February|Februar|Februa|Febru|Febr|Feb|
March|Marc|Mar|
April|Apri|Apr|
May|
June|Jun|
July|Jul|
August|Augus|Augu|Aug|
September|Septembe|Septemb|Septem|Septe|Sept|Sep|
October|Octobe|Octob|Octo|Oct|
November|Novembe|Novemb|Novem|Nove|Nov|
December|Decembe|Decemb|Decem|Dece|Dec
)
\s+
(?P<date>  ([1-9])  |  ([12][0-9])  |  (3[01])  ) # 1 through 31
\s+
(?P<hours>   ((0{0,1}|1)[0-9])  |  (2[0-3])  ) # (0 or 00) through 23
\s*\:\s*
(?P<minutes>  [0-5]{0,1}[0-9]  ) # (0 or 00) through 59
\s*\:\s*
(?P<seconds>  [0-5]{0,1}[0-9]  ) # (0 or 00) through 59
\s+
(?P<time_zone>  [ECMP][SD]T  )
\s+
(?P<year>  (19[0-9][0-9])  |  (20[01][0-9])  ) # 1900 through 2019
\b

Regular expression reg2 contains 16 groups of which 8 are named groups. The named groups make the expression easier to comprehend without comments. reg2 is a relatively simple expression. Without named groups and appropriate formatting as above, regular expressions quickly become incomprehensible.

dates = ''' 
MON Februar 12 0:30 : 19 CST 2018 
Tue    Feb  33      00:30:19       CST      2018 # Invalid.
Wed    Feb     29   00:30:19       CST      1900 # Invalid.  
Thursda               feb             29                  00:30:19           CST            1944    
'''

List all valid dates in string dates above.

d1 = dict ((
    ('Jan', 31),    ('May', 31),    ('Sep', 30),
    ('Feb', 28),    ('Jun', 30),    ('Oct', 31),
    ('Mar', 31),    ('Jul', 31),    ('Nov', 30),
    ('Apr', 30),    ('Aug', 31),    ('Dec', 31),
))

A listcomp accepts free-format Python:

L1 = [
'\n'.join(( str(m), m[0], str(m.groupdict()) ))
    for m in re.finditer(reg2, dates, re.IGNORECASE|re.VERBOSE)

for date in ( int(m['date']) ,)  # Equivalent to assignment: date = int(m['date'])
for month in ( m['month'].title() ,)
for year in ( int(m['year']) ,)

for leap_year in (        # 'else' in a listcomp                                      
    (                     # equivalent to:                                            
        year % 4 == 0,    # if year % 100 == 0:                                       
        year % 400 == 0   #     leap_year = year % 400 == 0                           
    )[year % 100 == 0]    # else :                                                    
,)                        #     leap_year = year % 4 == 0                             

for max_date in (                         # if (month[:3] == 'Feb') and leap_year :   
    (                                     #     max_date = 29                         
        d1[month[:3]],                    # else :                                    
        29                                #     max_date = d1[month[:3]]              
    )[(month[:3] == 'Feb') and leap_year] #                                           
,)

if date <= max_date
]

print (
    '\n\n'.join(L1)
)
<_sre.SRE_Match object; span=(1, 34), match='MON Februar 12 0:30 : 19 CST 2018'>
MON Februar 12 0:30 : 19 CST 2018
{'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'}

<_sre.SRE_Match object; span=(155, 251), match='Thursda               feb             29         > # Output here is clipped.
Thursda               feb             29                  00:30:19           CST            1944   # Correct data here.
{'day': 'Thursda', 'month': 'feb', 'date': '29', 'hours': '00', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '1944'}

To access the groupdict of a field that matches:

line = (L1[0].split('\n'))[2]
d2 = eval(line)
print ('d2 =', d2)
d2 = {'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'}
A little philosophy
edit

Because this example is contained within the page "Regular Expressions," there is much decision making contained within reg2. For an example of code in which there is less decision making in the regular expression and more decision making in listcomp L1 see an earlier version of Unix Dates.


The code in this section:

1) focuses on matching alpha-numeric patterns. Verification that February 29, 1944 was in fact a Thursday is outside the scope of this section.

2) does not consider the possibility of leap seconds. Saturday December 31 23:59:60 UTC 2016 was a legitimate time. It seems that accurate time (and how to display it) is a field of science unto itself and not yet standardized.

3) is not complete until properly tested. Testing the code could consume 3-10 times as much effort as writing it.

4) highlights that a listcomp is an ideal place for (almost) format-free Python code.

5) shows that, as a regular expression becomes more complicated, you may have to write Python code just to produce the regular expression.

Matching integers and floats

edit

Integers

edit

Examples of integers are: 123, +123, -123. Python's regular expressions scan strings, therefore int in this context means string representing int. Python's eval function tolerates some white space, therefore the following are examples of int: ' 123 ', ' +123', '-123 ', ' + 123 '.

Do not rely on Python's eval function to determine what a string represents:

>>> date = '12/3/4' ; eval(date) ; isinstance(eval(date), float)
1.0
True
>>>

Searching for integers:

>>> print (pattern)
                        
^         # anchor at beginning       
\s*       # any white                 
[+-]?     # 0 or 1 of ('+' or '-')    
\s*       # any white                 
\d+       # 1 or more numeric         
\s*       # any white                 
$         # anchor at end             

>>> re.search (pattern, '          123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match='          123           '>
>>> re.search (pattern, '       -   123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 25), match='       -   123           '>
>>> re.search (pattern, '       -   1 23           ', re.VERBOSE)
>>> # No match.

Method str.strip() produces (almost) a clean int:

>>> '  +13   '.strip()
'+13'
>>> '  +     13   '.strip()
'+     13'
>>>

Method str.replace() hides errors:

>>> ' + 12 34   '.replace(' ', '') # Error in input.
'+1234'                            # Good output.
>>>

To produce a clean int:

>>> print (pattern)
                                               
^         # anchor at beginning                              
\s*       # any white                                        
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around the '[+-]?'.         
\s*       # any white                                        
(\d+)     # 1 or more numeric. Notice the '()' around the '\d+'.              
\s*       # any white                                        
$         # anchor at end                                    

>>> re.search (pattern, '          123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match='          123           '>
>>> m = re.search (pattern, '        -  123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 25), match='        -  123           '>
>>> m.group()
'        -  123           '
>>> m.group(0)
'        -  123           '
>>> m.group(1,2)
('-', '123') # Values that match the expressions in '()' above.
>>> ''.join(m.group(1,2))
'-123'
>>>

Floats

edit

Examples of point floats: '3.', '.3', '3.3', ' - .3 ', ' + 4.4 '

Examples of exponent floats: ' 3e4 ', '3.E3', '.3e-3', '3.3E-3', ' - .3e+2 ', ' + 4.4E+11 '

An exponent float can contain an int as significand: '3e4' where '3' is the significand and '4' the exponent.

If not exponent float, it must be point float. This means at least one '.' and one digit.

Matching a point float:
edit
>>> print (pattern)
                                              
# for point float                                           
^         # anchor at beginning                             
\s*       # any white                                       
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.        
\s*       # any white                                       
(\.\d+|\d+\.|\d+\.\d+)     # .3 or 3. or 3.3                
\s*       # any white                                       
$         # anchor at end                                   

>>>
>>> m = re.search (pattern, '          .123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 26), match='          0.123           '>
>>> m.group(1,2)
('', '.123')
>>> m = re.search (pattern, '      -    0.123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 27), match='      -    0.123           '>
>>> m.group(1,2)
('-', '0.123')
>>>
Matching an exponent float:
edit
>>> print (patternE)
                                        
# for exponent float                                   
^         # anchor at beginning                        
\s*       # any white                                  
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.   
\s*       # any white                                  
(\.?\d+|\d+\.|\d+\.\d+)     # 3 or .3 or 3. or 3.3     
[eE]                                                   
([+-]?\d+) # exponent                                  
\s*       # any white                                  
$         # anchor at end                              

>>>
>>> m = re.search (patternE, '          . 123           ', re.VERBOSE) ; m
>>> # No match.
>>> m = re.search (patternE, '      -    0.123e+2           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 30), match='      -    0.123e+2           '>
>>> m.group(1,2,3)
('-', '0.123', '+2')
>>> 
>>> m = re.search (patternE, '      -    3.3E-12           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 29), match='      -    3.3E-12           '>
>>> m.group(1,2,3)
('-', '3.3', '-12')
>>> m.group(1) + m.group(2) + 'e' + m.group(3)
'-3.3e-12'
>>> 
>>> [ m.group(p) for p in range(1, m.lastindex+1) ]
['-', '3.3', '-12']
>>>
Matching any float
edit

This example shows how substrings that match may be retrieved quickly and accurately from named groups.

import re

reg_exp = r''' 
(?P<sign_of_float>[-+])? # Named group sign_of_float.            
\s* 
(?P<significand>\d*\.?\d*)      # Named group significand.
( 
    [eE] 
    (?P<sign_of_exponent>[-+])? # Named group sign_of_exponent.  
    (?P<exponent>\d+)           # Named group exponent.          
)? 
'''

The above reg_exp contains four named groups. As a match for float it is simple and correct but insufficient. Find all the floats in string s1:

s1 = '  + 5e2  5 +  5 -  5    .e3  . 5 e-2   5 . 5 .e 2 3.3  - 3.3E+1   '

s2 = ''' 
# Substring that matched reg_exp.  
# Same as m['sign_of_float']. 
# Same as m['significand']. 
# Group not named. 
# Same as m['sign_of_exponent']. 
# Same as m['exponent']. 
'''

L2 = [p.strip() for p in s2.split('\n') if re.search(r'\S', p)]

L1 = [ m for m in re.finditer(reg_exp, s1, re.VERBOSE)
          # Extra conditions for float:
          if (
              m['significand'] and m['exponent'], 
              len(m['significand']) >= 2
             )[ '.' in m['significand'] ]
     ]

for m in L1 :
    print (
''' 
########################## 
m = {}'''.format(m)
          )

    print ('\nInformation available in all groups:')
    for p in range(0, len(m.groups())+1) :
        if m[p] == None : s2 = 'm[{}] = None'.format( p )
        else : s2 = "m[{}] = '{}'".format( p, m[p] )
        s2 = (s2 + ' '*16)[:16] # Left justified in a string of 16 characters.   
        print (s2, L2[p])

exit (0)
##########################
m = <_sre.SRE_Match object; span=(2, 7), match='+ 5e2'>

Information available in all groups:
m[0] = '+ 5e2'   # Substring that matched reg_exp.
m[1] = '+'       # Same as m['sign_of_float'].
m[2] = '5'       # Same as m['significand'].
m[3] = 'e2'      # Group not named.
m[4] = None      # Same as m['sign_of_exponent'].
m[5] = '2'       # Same as m['exponent'].

##########################
m = <_sre.SRE_Match object; span=(49, 53), match=' 3.3'>

Information available in all groups:
m[0] = ' 3.3'    # Substring that matched reg_exp.
m[1] = None      # Same as m['sign_of_float'].
m[2] = '3.3'     # Same as m['significand'].
m[3] = None      # Group not named.
m[4] = None      # Same as m['sign_of_exponent'].
m[5] = None      # Same as m['exponent'].

##########################
m = <_sre.SRE_Match object; span=(55, 63), match='- 3.3E+1'>

Information available in all groups:
m[0] = '- 3.3E+1 # Substring that matched reg_exp.
m[1] = '-'       # Same as m['sign_of_float'].
m[2] = '3.3'     # Same as m['significand'].
m[3] = 'E+1'     # Group not named.
m[4] = '+'       # Same as m['sign_of_exponent'].
m[5] = '1'       # Same as m['exponent'].

Decoding a bytes object

edit

L2 contains the contents of a bytes object presented in binary format:

L2 = (
['11001110', '10010010', '11001110', '10111001', '11001110', '10111010', '11001110'] +
['10111001', '00100000', '11101100', '10011100', '10000100', '11101101', '10000010'] +
['10100100', '11101011', '10110000', '10110000', '11101100', '10011011', '10000000'] +
['00100000', '01010111', '01101001', '01101011', '01101001'] )

Produce list L4 that contains L2 in a format that conforms to standard utf-8

L3 = []

for p in range (len(L2)-1,-1,-1) :
    if re.search(r'^0[01]{7}$', L2[p]) :
        L3 += [L2[p]]
        continue

    if re.search(r'^110[01]{5}$', L2[p]) :
        if p+1 >= len(L2) : exit (99)
        if re.search(r'^10[01]{6}$', L2[p+1]) :
            L3 += [L2[p] + L2[p+1]]
            continue
        exit (98)

    if re.search(r'^1110[01]{4}$', L2[p]) :
        if p+2 >= len(L2) : exit (97)
        if re.search(r'^10[01]{6}$', L2[p+1]) and re.search(r'^10[01]{6}$', L2[p+2]) :
            L3 += [L2[p] + L2[p+1] + L2[p+2]]
            continue
        exit (96)

    if re.search(r'^10[01]{6}$', L2[p]) :
        if p == 0 : exit (95)
        continue

    exit (94)

L4 = L3[::-1]

print (
'''
L4 = (
{} + # Russian
{} + # '\\x20' is a space.
{} + # Korean
{} + # '\\x20' is a space.
{} ) # English
'''.format(L4[0:4], L4[4:5], L4[5:9], L4[9:10], L4[10:])
)
L4 = (
['1100111010010010', '1100111010111001', '1100111010111010', '1100111010111001'] + # Russian
['00100000'] + # '\x20' is a space.
['111011001001110010000100', '111011011000001010100100', '111010111011000010110000', '111011001001101110000000'] + # Korean
['00100000'] + # '\x20' is a space.
['01010111', '01101001', '01101011', '01101001'] ) # English

Decode L4:

L5 = []

for p in range (0, len(L4)) :
    if (len(L4[p]) == 8) :
        m = re.search (r'^0[01]{7}$', L4[p])
        if not m : exit (89)
        I1 = int(L4[p], base=2) ; L5 += chr(I1)
        continue

    if (len(L4[p]) == 16) :
        m = re.search (r'^110([01]{5})10([01]{6})$', L4[p])
        if not m : exit (88)
        if m.lastindex != 2 : exit (87)
        I1 = int(m.group(1) + m.group(2), 2) ; L5 += chr(I1)
        continue

    if (len(L4[p]) == 24) :
        m = re.search (r'^  1110  ([01]{4})  10  ([01]{6})  10  ([01]{6})  $', L4[p], re.VERBOSE)
        if not m : exit (86)
        if m.lastindex != 3 : exit (85)
        I1 = int(m.group(1) + m.group(2) + m.group(3), 2) ; L5 += chr(I1)
        continue

    exit (84)

print ('L5 =', L5)

exit (0)
L5 = ['Β', 'ι', 'κ', 'ι', ' ', '위', '키', '배', '움', ' ', 'W', 'i', 'k', 'i']

Compiling regular expressions

edit

If a regular expression is complicated or is to be used frequently, it can be compiled to produce a pattern object.

>>> print (pattern)
                                  
([+-]{1}   # 1 of ('+' or '-').
\s*        # any white
\d+)       # 1 or more numeric.
|
(\d+)      # 1 or more numeric.

>>>

The regular expression pattern represents an integer. Produce a pattern object called 'integer'.

>>> integer = re.compile(pattern, re.VERBOSE)

The compiled pattern called 'integer' has methods similar to re.search(), re.finditer() and re.split():

>>> s1 = '    123       -  456((     !!+++    2345 !! -2##'

Displaying all matches

edit

Displaying all matches manually, one after the other.

>>> integer.search(s1)
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> integer.search(s1[7:])
<_sre.SRE_Match object; span=(7, 13), match='-  456'>
>>> integer.search(s1[7:][13:])
<_sre.SRE_Match object; span=(11, 20), match='+    2345'>
>>> integer.search(s1[7:][13:][20:])
<_sre.SRE_Match object; span=(4, 6), match='-2'>
>>> integer.search(s1[7:][13:][20:][6:])
>>>

The method integer.search(...) accepts optional positional parameters:

>>> m = integer.search(s1) ; m
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> m = integer.search(s1, 7) ; m
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
>>> m = integer.search(s1, 20) ; m
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
>>> m = integer.search(s1, m.span()[1]) ; m
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>> m = integer.search(s1, m.span()[1]) ; m
>>>

Iterating through all matches.

edit
>>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>

or:

v = 0

while True :
    m = integer.search(s1, v)
    if not m : break
    print (m)
    v = m.span()[1]

Output is same as above.

Splitting the string

edit

Splitting the string s1:

Preserving substrings that match

edit
>>> s1
'    123       -  456((     !!+++    2345 !! -2##'
>>> 
>>> [m.groups() for m in integer.finditer(s1)]
[(None, '123'),    # Match came from right hand side of '|' in pattern above.
('-  456', None),  # Match came from left hand side of '|' in pattern above because it contains sign '-'.
('+    2345', None), ('-2', None)]
>>> 
>>> L1 = integer.split(s1) ; L1
['    ', None, '123', '       ', '-  456', None, '((     !!++', '+    2345', None, ' !! ', '-2', None, '##']
>>> L1 # Edited for clarity:
['    ', 
None, '123', # Same as m.groups()[0] above.
'       ', 
'-  456', None,  # Same as m.groups()[1] above.
'((     !!++', 
'+    2345', None,  # Same as m.groups()[2] above.
' !! ', 
'-2', None,  # Same as m.groups()[3] above.
'##']
>>> 
>>> L2 = [p for p in L1 if p != None] 
>>> print ('L2 =', L2)
L2 = ['    ', '123', '       ', '-  456', '((     !!++', '+    2345', ' !! ', '-2', '##']
>>> 
>>> s2 = ''.join(L2) ; s2
'    123       -  456((     !!+++    2345 !! -2##'
>>> s2 == s1
True
>>>

Without preserving substrings that match

edit

In pattern_ below note that parentheses have been removed from the expressions r'[+-]{1}\s*\d+' and r'\d+'.

>>> print (pattern_)
                    
[+-]{1}   # 1 of ('+' or '-').    
\s*        # any white            
\d+       # 1 or more numeric.    
|                                 
\d+      # 1 or more numeric.     

>>> integer_ = re.compile(pattern_, re.VERBOSE)
>>> s1 = '    123       -  456((     !!+++    2345 !! -2##'
>>> L1 = integer_.split(s1) ; L1
['    ', '       ', '((     !!++', ' !! ', '##'] # L1 does not contain the substrings that match.
>>>

Replacing all substrings that match

edit

Replacing all integers in string s1:

After splitting the string

edit
>>> L2
['    ', '123', '       ', '-  456', '((     !!++', '+    2345', ' !! ', '-2', '##']
>>> L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']
>>>

'123' is to be replaced by 'INT_1'.

'- 456' is to be replaced by 'INT_2'.

'+ 2345' is to be replaced by 'INT_3'.

'-2' is to be replaced by 'INT_4'.

>>> L5 = [ (L2[p], L4[(p-1)>>1])[p & 1] for p in range (len(L2)) ] ; L5
['    ', 'INT_1', '       ', 'INT_2', '((     !!++', 'INT_3', ' !! ', 'INT_4', '##']
>>> 
>>> s3 = ''.join(L5) ; s3
'    INT_1       INT_2((     !!++INT_3 !! INT_4##'
>>>

Without splitting the string

edit
print ("s2 =", "'"+s2+"'",'\n')

L1 = [m for m in integer.finditer(s2)]
print ( '\n'.join(['4 matches found:'] + [str(p) for p in L1]),'\n' )

print ("L4 =", L4,'\n')

for p in range (3,-1,-1) :
    m = L1[p]
    repl = L4[p]
    start,end = m.span()
    s2 = s2[:start] + repl + s2[end:]
    print (
'''s2 = '{}' after replacing span {}
'''.format(s2, m.span()),
end=''
)
s2 = '    123       -  456((     !!+++    2345 !! -2##'

4 matches found:
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>

L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']

s2 = '    123       -  456((     !!+++    2345 !! INT_4##' after replacing span (44, 46)
s2 = '    123       -  456((     !!++INT_3 !! INT_4##' after replacing span (31, 40)
s2 = '    123       INT_2((     !!++INT_3 !! INT_4##' after replacing span (14, 20)
s2 = '    INT_1       INT_2((     !!++INT_3 !! INT_4##' after replacing span (4, 7)

Python: truly international

edit

Python, emacs and the Wikiversity editor recognize an almost infinite number of international characters. Some of them look exactly like their english counterparts:

>>> ord('Ρ') # Greek rho
929
>>> ord('P') # English P.
80
>>> ord('H') # English
72
>>> ord('Н') # Cyrillic
1053
>>>

A few well chosen international characters can simplify the creation of a complicated regular expression. Let's revisit the matching of floats.

import re

 = ' [eE] [+-]{0,1} \d+ '     # eẌponent
Ӗ = ' \d+ Ẍ '                   # Ӗxponent float
 = ' (\d+\.\d*|\.\d+)  ( Ẍ ) ? ' # Ṕoint float

pattern = " \s  ([\+\-])?  \s*  (Ӗ|Ṕ)  \s "

print ('eẌponent =',)
print ('Ӗxponent float =',Ӗ)
print ('Ṕoint float =',)

print ()
print ('pattern =',pattern)
for c in 'ṔӖẌ' :
    pattern = re.sub(c, eval(c), pattern)
    print ('pattern =',pattern, '#', c, 'replaced.')
p1 = '^^^^ Ӗxponent float ^^^'
p2 = '^^^^^^^^^^^^^^^^ Ṕoint float ^^^^^^^^^^^^^^^'
print (' '*32, p1, '  ', p2)

print ()
s1 = '  + 5e2  5 +  5   1.e-4 -  5    .e3  . 5 e-2   5 . 5 .e 2 4.7  - 3.3E+1  '
print ('s1 =', "'"+s1+"'")

# Find all floats in string s1.                                                                                                                                                
print ()
for m in re.finditer(pattern, s1, re.VERBOSE) :
    print (str(m))
    print ('   ', m.groups())

eẌponent =  [eE] [+-]{0,1} \d+
Ӗxponent float =  \d+ 
Ṕoint float =  (\d+\.\d*|\.\d+)  (  ) ?

pattern =  \s  ([\+\-])?  \s*  (Ӗ|)  \s
pattern =  \s  ([\+\-])?  \s*  (Ӗ| (\d+\.\d*|\.\d+)  (  ) ? )  \s  # Ṕ replaced.
pattern =  \s  ([\+\-])?  \s*  ( \d+  | (\d+\.\d*|\.\d+)  (  ) ? )  \s  # Ӗ replaced.
pattern =  \s  ([\+\-])?  \s*  ( \d+  [eE] [+-]{0,1} \d+  | (\d+\.\d*|\.\d+)  (  [eE] [+-]{0,1} \d+  ) ? )  \s  # Ẍ replaced.
                                 ^^^^ Ӗxponent float ^^^    ^^^^^^^^^^^^^^^^ Ṕoint float ^^^^^^^^^^^^^^^

s1 = '  + 5e2  5 +  5   1.e-4 -  5    .e3  . 5 e-2   5 . 5 .e 2 4.7  - 3.3E+1  '

<_sre.SRE_Match object; span=(1, 8), match=' + 5e2 '>
    ('+', '5e2', None, None)
<_sre.SRE_Match object; span=(15, 24), match='   1.e-4 '>
    (None, '1.e-4', '1.', 'e-4')
<_sre.SRE_Match object; span=(57, 62), match=' 4.7 '>
    (None, '4.7', '4.7', None)
<_sre.SRE_Match object; span=(62, 72), match=' - 3.3E+1 '>
    ('-', '3.3E+1', '3.3', 'E+1')

Assignments

edit
 

Simplify the pattern?

edit

Under "Compiling regular expressions" above the expression for integer is:

>>> print (pattern)
                                  
([+-]{1}   # 1 of ('+' or '-').
\s*        # any white
\d+)       # 1 or more numeric.
|
(\d+)      # 1 or more numeric.

>>>

Why not simplify the expression and use:

>>> print (pattern)
                                  
([+-]{0,1}   # 0 or 1 of ('+' or '-').
\s*          # any white
\d+)         # 1 or more numeric.

>>>

Because this expression produces the following matches:

>>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(0, 7), match='    123'> # This match is not considered accurate.
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>

Matching a float?

edit

The reference offers regular expression r'[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?' to match a float. Does this expression provide a good match? Yes, but it also matches int:

>>> reg = r'[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?' 
>>> re.search(reg, '4')
<_sre.SRE_Match object; span=(0, 1), match='4'>
>>>

Floats with extra zeroes

edit

In the section "Floats" above there are several examples of regular expressions that match floats. However, they do not consider the possibility of extra leading and trailing zeroes.

>>> eval ( '   +  0003.4000e-00005   ')
3.4e-05
>>>

How would you rewrite the expressions to remove unnecessary zeroes?

Further Reading or Review

edit

References

edit

1. Python's documentation:

"6.2. re — Regular expression operations," "Regular Expression HOWTO," "Common Problems"


2. Python's methods:

"re.compile(pattern, flags=0)," "re.search(pattern, string, flags=0)," "re.split(pattern, string, maxsplit=0, flags=0)," "re.finditer(pattern, string, flags=0)," "re.escape(pattern),"

"regex.search(string[, pos[, endpos]])," "regex.finditer(string[, pos[, endpos]]),"

"match.group([group1, ...])," "match.groupdict(default=None)," "match.span([group])"


3. Python's built-in functions: