Regex Essentials: Overview
- Regular expressions (regex) are a language for defining text search patterns.
- Python’s
re module provides functions like search (find anywhere) and match (anchored at start).
- Patterns include literals, metacharacters (
. ^ $ * + ? [] \), character classes (\d, \w, \s), and quantifiers (*, +, ?, {n,m}).
- Greedy quantifiers (
*, +) match as much as possible; non-greedy (*?, +?) as little as possible.
Introduction to re.search() vs re.match()
re.search(pattern, text) scans the entire string for the first occurrence.
re.match(pattern, text) checks only at the beginning of the string.
re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
- Always use raw strings (
r"...") to define regex patterns, avoiding Python string escapes interfering with regex.
import re
line = "WARN: Disk usage at 91%"
pattern = r"WARN"
print(f"search '{pattern}':", bool(re.search(pattern, line)))
print(f"match '{pattern}':", bool(re.match(pattern, line)))
Common Metacharacters
. matches any character (except newline).
^ anchors at start of string.
$ anchors at end of string.
[] defines a set or range of characters, e.g. [A-Z].
\ escapes metacharacters or introduces special sequences.
import re
test = "Error code: E1234. cxge"
print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Start anchor (finds): {re.findall(r"^Error", test)}")
print(f"Start anchor (does not find): {re.findall(r"^E1234", test)}")
print(f"End anchor: {re.findall(r"cxge$", test)}")
print(f"Character set: {re.findall(r"[E0-9]+", test)}")
Special Sequences
\d digit (0–9), \D non-digit.
\w word character (letters, digits, underscore), \W non-word.
\s whitespace, \S non-whitespace.
\b word boundary (zero-width match).
import re
text = "The cat scattered 1024 catalogues."
print(f"Digits: {re.findall(r"\d+", text)}")
print(f"Word characters: {re.findall(r"\w+", text)}")
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}")
Quantifier Cheat-Sheet
| Quantifier |
Meaning |
Greedy? |
Non-greedy form |
Meaning |
? |
0 or 1 of the preceding token |
Yes |
?? |
as few as possible (0 or 1) |
* |
0 or more of the preceding token |
Yes |
*? |
as few as possible (including zero) |
+ |
1 or more of the preceding token |
Yes |
+? |
as few as possible (at least one) |
{n} |
exactly n of the preceding token |
- |
- |
- |
{n,} |
n or more of the preceding token |
Yes |
{n,}? |
n or more, but as few as possible |
{n,m} |
between n and m of the preceding token |
Yes |
{n,m}? |
between n and m, but as few as possible |
import re
text = "aaaa"
print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))
print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")
Quantifiers & Greedy vs Non-Greedy
* / + / {n,} are greedy: match as much as possible.
- Append
? (*? / +? / {n,}?) to make them non-greedy: match as little as possible.
- Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a
? after them makes them non-greedy (or lazy), matching the shortest possible string.
import re
html = "<>
</>"
print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")
Capturing Groups and Back-References
- Regex lets you check for patterns, but often you need to extract pieces of the match (e.g., IP vs port).
- Capturing groups, defined with
(), let you isolate and retrieve substrings from a match.
- Named groups improve readability by giving meaningful labels instead of relying on group numbers.
- Non-capturing groups
(?:…) let you apply grouping logic without cluttering captures.
- Back-references allow you to match the same text twice (or more) within one pattern.
Capturing Groups
- Parentheses
() both group and capture the matched text inside them.
- Groups are numbered by their opening
(, starting at 1; group 0 is the entire match.
- Use
match.group(n) for a single group or match.groups() to get all captures as a tuple.
- Capturing is essential when you need to feed specific substrings into further processing.
import re
log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"
# Our goal:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address
pattern = r"Level=(\w+)\s+User=(\w+).*?\s+IP=([\d\.]+)"
match = re.search(pattern, log_entry)
if match:
print(f"Full match: {match.group(0)}")
print(f"Level: {match.group(1)}")
print(f"User: {match.group(2)}")
print(f"IP: {match.group(3)}")
print(f"All groups: {match.groups()}")
Named Capturing Groups
- Syntax:
(?P<name>pattern) assigns a label to a capturing group.
- Access by name:
match.group('name') makes code self-documenting.
match.groupdict() returns a dict of all named captures.
- You can still use numeric indices if needed, but names help avoid off-by-one errors.
import re
log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"
# Add labels to:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address
pattern = r"Level=(?P<level>\w+)\s+User=(?P<user>\w+).*?\s+IP=(?P<ip>[\d\.]+)"
match = re.search(pattern, log_entry)
if match:
print(f"Full match: {match.group(0)}")
print(f"Level: {match.group("level")}")
print(f"User: {match.group("user")}")
print(f"IP: {match.group("ip")}")
print(f"All groups: {match.groups()}")
print(f"Group dictionary: {match.groupdict()}")
Non-Capturing Groups
- Use
(?:pattern) when you need grouping for quantifiers or alternation without capturing.
- Keeps your capture numbers focused on what you actually want.
- Prevents unwanted
None entries in match.groups() when using optional parts.
import re
log_line1 = "report.txt Status: OK"
log_line2 = "report Status: OK"
# Our goal:
# 1. Group 1: The stem of the filename, with .txt being an optional string
# 2. Group 2: The status code
pattern = r"^(.+?)(?:\.txt)?\s+Status:\s+(.+)$"
match_line1 = re.search(pattern, log_line1)
match_line2 = re.search(pattern, log_line2)
if match_line1: print(match_line1.groups())
if match_line2: print(match_line2.groups())
Back-references
- Refer back to a previous capture using
\1, \2, … or (?P=name) for named groups.
- Useful for matching repeated words or balanced constructs (e.g., open/close tags).
- Can make patterns more complex but powerful for advanced text validation.
import re
text = "This this is a test test."
pattern_numbers = r"(?i)\b(\w+)\s+\1\b"
pattern_labels = r"(?i)\b(?P<word>\w+)\s+(?P=word)\b"
print(f"Doubled words: {re.findall(pattern_numbers, text)}")
print(f"Doubled words: {re.findall(pattern_labels, text)}")
html = "<b>Bold</b>"
pattern_tags = r"<(\w+)>(.*?)</\1>"
print(f"Tags: {re.findall(pattern_tags, html)}")
Search, Split, and Substitute
re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
re.split() handles complex delimiters beyond simple string splits.
re.sub() performs powerful search-and-replace operations, including reuse of captured groups.
Finding All Matches
re.findall(pattern, string) returns a list of all non-overlapping matches:
- No groups → list of matched substrings.
- With groups → list of tuples of captured substrings.
re.finditer(pattern, string) returns an iterator of match objects, giving access to .group(), positions, named groups, etc., and is more memory-efficient for large inputs.
import re
text = "Errors found: 404, 500, 403, 500. User IDs: user123, admin99."
config = "timeout=60 retries=3 workers=5"
# Find all error codes:
print(f"Numbers found: {re.findall(r"\d+", text)}")
# findall with groups:
print(f"Key-value pairs: {re.findall(r"(\w+)=(\w+)", config)}")
# finditer
for match in re.finditer(r"(\w+)=(\w+)", config):
print(f"Whole match: {match.group(0)}; key: {match.group(1)}; value: {match.group(2)} - at {match.start()}-{match.end()}")
Splitting Strings
- Use
re.split(pattern, string) to break a string on a regex pattern, not just a fixed substring.
- Always use a raw string literal so backslashes reach the regex engine.
- Simple single-character delimiters: use a character class (never captured), e.g.
r"\s*[,;]\s*".
- Complex delimiters (alternation or multi-character): group with non-capturing parentheses, e.g.
r"\s*(?:foo|bar|baz)\s*", so they aren’t included in the result list.
- Including delimiters: wrap your delimiter in a capturing group, e.g.
r"\s*([,;])\s*", to have the separators appear in the split output.
- Summary:
- No parentheses or a non-capturing group → delimiters are removed.
- Capturing group → delimiters appear in the split list.
import re
data = "item1 , item2; item3 ,item4 ;item5"
# 1. Split on comma and semi-colon
pattern1 = r"\s*[,;]\s*"
print(f"Character class split: {re.split(pattern1, data)}")
# 2. Capturing the delimiter
pattern2 = r"\s*([,;])\s*"
print(f"Capturing group split: {re.split(pattern2, data)}")
html = """
<b class='world'>Second paragraph.</b>
End.
"""
pattern3 = r"<.*?class='(?:hello|world)'.*?>|</[pb]>"
print(f"HTML non-capturing split: {re.split(pattern3, html)}")
Substituting Text
re.sub(pattern, replacement, string, count=0) replaces all (or a limited number) of matches.
count controls how many replacements to make (default 0 = all).
- Back-references (
\1, \g<name>) let you reorder or reuse captured text in the replacement.
import re
text = "User IDs: user123, user456, user123457689. Contact admin789 for help."
# Basic substitution
redacted = re.sub(r"user\d+", "[REDACTED_USER]", text)
print(f"Result of redacting: {redacted}")
# Back-reference for reusing information
redacted_partially = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text)
print(f"Result of redacting: {redacted_partially}")
# Limited count of substitutions
redacted_only_two = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text, count=2)
print(f"Result of redacting: {redacted_only_two}")
# Named groups for substitution
date_text = "Start: 2023-10-27, End: 2024-01-15"
# Current format YYYY-MM-DD
# Target format DD/MM/YYYY
date_pattern_named = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
replacement_format_named = r"\g<day>/\g<month>/\g<year>"
reformatted_date = re.sub(date_pattern_named, replacement_format_named, date_text)
print(f"Result of date transformation: {reformatted_date}")