Python string regexp

Regular expressions in python

Official documentation: https://docs.python.org/3/library/re.html

A regular expression is a recursive succession of rules that will match parts of a string. Some notable example:

[a-z][a-z] matches two consecutive lower case letters

import re
some_match = re.search('[a-z][a-z]', '1A_ux_FHX_yu')
some_match
<re.Match object; span=(3, 5), match='ux'>

We can retrieve the position of the match

some_match.span()
(3, 5)

We can retrieve the string itself

some_match.string
'1A_ux_FHX_yu'

We can retrieve the group (substring of the origin string) that was matched

some_match.group()
'ux'

Examples

  • [A-Z] matches one upper case letter
  • [a-z] matches one lower case letter
  • [a-] and [-a] matches either the letter a or the character -
  • [A-Z][A-Z] matches two consecutive upper case letters
  • \d or [0-9]: matches a digit
  • .: matches any single characters
  • \w: matches a whole word
  • \s: matches a space character
  • \b: Matches the empty string, but only at the beginning or end of a word.
  • More generally, [] indicates a set of possible characters

Beginning/end of sentence:

  • ^: matches the beginning of the line
  • $: matches the end of the line

Repetitions:

  • [A-Z][A-Z]\d matches two consecutive upper case letters followed by a digit
  • +: matches the preceding pattern one or more times
  • *: matches the preceding pattern zero or more times
  • ?: matches the preceding pattern zero or one time
  • \d+ matches one or more consecutive digits
  • \d{2,3} matches two or three consecutive digits

Logical or:

  • (ab|123) matches either ab or 123
  • ([a-z]+|\d*) matches either one or more lowercase letters, or zero ore more digits
another_match = re.search(r'[A-Z][a-z]\d', 'something is going to match? Ab3 maybe.')
another_match
<re.Match object; span=(29, 32), match='Ab3'>
re.search(r'[A-Z][a-z]\d', 'something is going to match? Ab3 maybe.')
<re.Match object; span=(29, 32), match='Ab3'>

Escaping

sentence = 'I paid $10'
sentence2 = 'I paid $1061.50'
# without escaping the dollar sign, nothing is found
re.search(r'$\d+', sentence) is None
True
# dollar sign needs to be escaped
re.search(r'\$\d+', sentence)
<re.Match object; span=(7, 10), match='$10'>
# dollar sign and the dot need to be escaped with a backslash
re.search(r'\$\d+(\.\d\d)?', sentence)
<re.Match object; span=(7, 10), match='$10'>
# dollar sign and the dot need to be escaped with a backslash
re.search(r'\$\d+(\.\d\d)?', sentence2)
<re.Match object; span=(7, 15), match='$1061.50'>

Capturing an IP address in a string

sentence = "my ip is given by 192.168.0.185 is that correct?"
expression = "((?:\d{1,3}\.){3}\d{1,3})"
matches = re.findall(expression, sentence)
matches
['192.168.0.185']

Capturing course numbers and credits

sentence = '16:332:509 Convex Optimization for Engineering Applications (3)'
expression = r"(\d\d:\d\d\d:\d\d\d)"
matches = re.findall(expression, sentence)
matches
['16:332:509']
re.findall(r'(\d+:\d+:\d+)', sentence)
['16:332:509']
re.findall(r'(\d{2}:\d{3}:\d{2,3})', sentence)
['16:332:509']

Capturing matches in named groups

sentence = '16:332:509 Convex Optimization for Engineering Applications (3)'
expression = "(\d{2,3}:?){3}\s"
re.findall(expression, sentence)
['509']
sentence = '16:332:509 Convex Optimization for Engineering Applications (3)'
expression = r"(?P<school>\d{2}):(?P<level>\d{3}):(?P<course>\d{3})\s(?P<title>(\w\s?)+)\s\((?P<credits>\d+)\)"
m = re.search(expression, sentence)
m.groupdict()
{'school': '16',
 'level': '332',
 'course': '509',
 'title': 'Convex Optimization for Engineering Applications',
 'credits': '3'}
compiled = re.compile(r"(?P<school>\d{2}):(?P<level>\d{3}):(?P<course>\d{3})\s(?P<title>(\w\s?)+)\s\((?P<credits>\d+)\)",
                      re.DEBUG)
SUBPATTERN 1 0 0
  MAX_REPEAT 2 2
    IN
      CATEGORY CATEGORY_DIGIT
LITERAL 58
SUBPATTERN 2 0 0
  MAX_REPEAT 3 3
    IN
      CATEGORY CATEGORY_DIGIT
LITERAL 58
SUBPATTERN 3 0 0
  MAX_REPEAT 3 3
    IN
      CATEGORY CATEGORY_DIGIT
IN
  CATEGORY CATEGORY_SPACE
SUBPATTERN 4 0 0
  MAX_REPEAT 1 MAXREPEAT
    SUBPATTERN 5 0 0
      IN
        CATEGORY CATEGORY_WORD
      MAX_REPEAT 0 1
        IN
          CATEGORY CATEGORY_SPACE
IN
  CATEGORY CATEGORY_SPACE
LITERAL 40
SUBPATTERN 6 0 0
  MAX_REPEAT 1 MAXREPEAT
    IN
      CATEGORY CATEGORY_DIGIT
LITERAL 41

  0. INFO 4 0b0 16 MAXREPEAT (to 5)
  5: MARK 0
  7. REPEAT_ONE 9 2 2 (to 17)
 11.   IN 4 (to 16)
 13.     CATEGORY UNI_DIGIT
 15.     FAILURE
 16:   SUCCESS
 17: MARK 1
 19. LITERAL 0x3a (':')
 21. MARK 2
 23. REPEAT_ONE 9 3 3 (to 33)
 27.   IN 4 (to 32)
 29.     CATEGORY UNI_DIGIT
 31.     FAILURE
 32:   SUCCESS
 33: MARK 3
 35. LITERAL 0x3a (':')
 37. MARK 4
 39. REPEAT_ONE 9 3 3 (to 49)
 43.   IN 4 (to 48)
 45.     CATEGORY UNI_DIGIT
 47.     FAILURE
 48:   SUCCESS
 49: MARK 5
 51. IN 4 (to 56)
 53.   CATEGORY UNI_SPACE
 55.   FAILURE
 56: MARK 6
 58. REPEAT 22 1 MAXREPEAT (to 81)
 62.   MARK 8
 64.   IN 4 (to 69)
 66.     CATEGORY UNI_WORD
 68.     FAILURE
 69:   REPEAT_ONE 9 0 1 (to 79)
 73.     IN 4 (to 78)
 75.       CATEGORY UNI_SPACE
 77.       FAILURE
 78:     SUCCESS
 79:   MARK 9
 81: MAX_UNTIL
 82. MARK 7
 84. IN 4 (to 89)
 86.   CATEGORY UNI_SPACE
 88.   FAILURE
 89: LITERAL 0x28 ('(')
 91. MARK 10
 93. REPEAT_ONE 9 1 MAXREPEAT (to 103)
 97.   IN 4 (to 102)
 99.     CATEGORY UNI_DIGIT
101.     FAILURE
102:   SUCCESS
103: MARK 11
105. LITERAL 0x29 (')')
107. SUCCESS