Regex/shaare/YC8opA

python

Regex Essentials: Overview

Regular expressions (regex) are a language for defining text search patterns.
Python’s re module provides functions like search (find anywhere) and match (anchored at start).
Patterns include literals, metacharacters (. ^ $ * + ? [] \), character classes (\d, \w, \s), and quantifiers (*, +, ?, {n,m}).
Greedy quantifiers (*, +) match as much as possible; non-greedy (*?, +?) as little as possible.

Introduction to `re.search()` vs `re.match()`

re.search(pattern, text) scans the entire string for the first occurrence.
re.match(pattern, text) checks only at the beginning of the string.
re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
Always use raw strings (r"...") to define regex patterns, avoiding Python string escapes interfering with regex.

import re

line = "WARN: Disk usage at 91%"
pattern = r"WARN"

print(f"search '{pattern}':", bool(re.search(pattern, line)))
print(f"match '{pattern}':", bool(re.match(pattern, line)))

Common Metacharacters

. matches any character (except newline).
^ anchors at start of string.
$ anchors at end of string.
[] defines a set or range of characters, e.g. [A-Z].
\ escapes metacharacters or introduces special sequences.

import re

test = "Error code: E1234. cxge"

print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Start anchor (finds): {re.findall(r"^Error", test)}")
print(f"Start anchor (does not find): {re.findall(r"^E1234", test)}")
print(f"End anchor: {re.findall(r"cxge$", test)}")
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

Special Sequences

\d digit (0–9), \D non-digit.
\w word character (letters, digits, underscore), \W non-word.
\s whitespace, \S non-whitespace.
\b word boundary (zero-width match).

import re

text = "The cat scattered 1024 catalogues."

print(f"Digits: {re.findall(r"\d+", text)}")
print(f"Word characters: {re.findall(r"\w+", text)}")
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}")

Quantifier Cheat-Sheet

Quantifier	Meaning	Greedy?	Non-greedy form	Meaning
`?`	0 or 1 of the preceding token	Yes	`??`	as few as possible (0 or 1)
`*`	0 or more of the preceding token	Yes	`*?`	as few as possible (including zero)
`+`	1 or more of the preceding token	Yes	`+?`	as few as possible (at least one)
`{n}`	exactly n of the preceding token	-	-	-
`{n,}`	n or more of the preceding token	Yes	`{n,}?`	n or more, but as few as possible
`{n,m}`	between n and m of the preceding token	Yes	`{n,m}?`	between n and m, but as few as possible

import re

text = "aaaa"

print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")

Quantifiers & Greedy vs Non-Greedy

* / + / {n,} are greedy: match as much as possible.
Append ? (*? / +? / {n,}?) to make them non-greedy: match as little as possible.
Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a ? after them makes them non-greedy (or lazy), matching the shortest possible string.

import re

html = "<>
</>"

print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")

Capturing Groups and Back-References

Regex lets you check for patterns, but often you need to extract pieces of the match (e.g., IP vs port).
Capturing groups, defined with (), let you isolate and retrieve substrings from a match.
Named groups improve readability by giving meaningful labels instead of relying on group numbers.
Non-capturing groups (?:…) let you apply grouping logic without cluttering captures.
Back-references allow you to match the same text twice (or more) within one pattern.

Capturing Groups

Parentheses () both group and capture the matched text inside them.
Groups are numbered by their opening (, starting at 1; group 0 is the entire match.
Use match.group(n) for a single group or match.groups() to get all captures as a tuple.
Capturing is essential when you need to feed specific substrings into further processing.

import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Our goal:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(\w+)\s+User=(\w+).*?\s+IP=([\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group(1)}")
    print(f"User: {match.group(2)}")
    print(f"IP: {match.group(3)}")
    print(f"All groups: {match.groups()}")

Named Capturing Groups

Syntax: (?P<name>pattern) assigns a label to a capturing group.
Access by name: match.group('name') makes code self-documenting.
match.groupdict() returns a dict of all named captures.
You can still use numeric indices if needed, but names help avoid off-by-one errors.

import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Add labels to:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(?P<level>\w+)\s+User=(?P<user>\w+).*?\s+IP=(?P<ip>[\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group("level")}")
    print(f"User: {match.group("user")}")
    print(f"IP: {match.group("ip")}")
    print(f"All groups: {match.groups()}")
    print(f"Group dictionary: {match.groupdict()}")

Non-Capturing Groups

Use (?:pattern) when you need grouping for quantifiers or alternation without capturing.
Keeps your capture numbers focused on what you actually want.
Prevents unwanted None entries in match.groups() when using optional parts.

import re

log_line1 = "report.txt Status: OK"
log_line2 = "report Status: OK"

# Our goal:
# 1. Group 1: The stem of the filename, with .txt being an optional string
# 2. Group 2: The status code

pattern = r"^(.+?)(?:\.txt)?\s+Status:\s+(.+)$"

match_line1 = re.search(pattern, log_line1)
match_line2 = re.search(pattern, log_line2)

if match_line1: print(match_line1.groups())
if match_line2: print(match_line2.groups())

Back-references

Refer back to a previous capture using \1, \2, … or (?P=name) for named groups.
Useful for matching repeated words or balanced constructs (e.g., open/close tags).
Can make patterns more complex but powerful for advanced text validation.

import re

text = "This this is a test test."
pattern_numbers = r"(?i)\b(\w+)\s+\1\b"
pattern_labels = r"(?i)\b(?P<word>\w+)\s+(?P=word)\b"

print(f"Doubled words: {re.findall(pattern_numbers, text)}")
print(f"Doubled words: {re.findall(pattern_labels, text)}")

html = "<b>Bold</b>"
pattern_tags = r"<(\w+)>(.*?)</\1>"

print(f"Tags: {re.findall(pattern_tags, html)}")

Search, Split, and Substitute

re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
re.split() handles complex delimiters beyond simple string splits.
re.sub() performs powerful search-and-replace operations, including reuse of captured groups.

Finding All Matches

re.findall(pattern, string) returns a list of all non-overlapping matches:
- No groups → list of matched substrings.
- With groups → list of tuples of captured substrings.
re.finditer(pattern, string) returns an iterator of match objects, giving access to .group(), positions, named groups, etc., and is more memory-efficient for large inputs.

import re

text = "Errors found: 404, 500, 403, 500. User IDs: user123, admin99."
config = "timeout=60 retries=3 workers=5"

# Find all error codes:
print(f"Numbers found: {re.findall(r"\d+", text)}")

# findall with groups:
print(f"Key-value pairs: {re.findall(r"(\w+)=(\w+)", config)}")

# finditer
for match in re.finditer(r"(\w+)=(\w+)", config):
    print(f"Whole match: {match.group(0)}; key: {match.group(1)}; value: {match.group(2)} - at {match.start()}-{match.end()}")

Splitting Strings

Use re.split(pattern, string) to break a string on a regex pattern, not just a fixed substring.
Always use a raw string literal so backslashes reach the regex engine.
Simple single-character delimiters: use a character class (never captured), e.g. r"\s*[,;]\s*".
Complex delimiters (alternation or multi-character): group with non-capturing parentheses, e.g. r"\s*(?:foo|bar|baz)\s*", so they aren’t included in the result list.
Including delimiters: wrap your delimiter in a capturing group, e.g. r"\s*([,;])\s*", to have the separators appear in the split output.
Summary:
- No parentheses or a non-capturing group → delimiters are removed.
- Capturing group → delimiters appear in the split list.

import re

data = "item1 , item2; item3 ,item4 ;item5"

# 1. Split on comma and semi-colon
pattern1 = r"\s*[,;]\s*"
print(f"Character class split: {re.split(pattern1, data)}")

# 2. Capturing the delimiter
pattern2 = r"\s*([,;])\s*"
print(f"Capturing group split: {re.split(pattern2, data)}")

html = """
<b class='world'>Second paragraph.</b>
End.
"""

pattern3 = r"<.*?class='(?:hello|world)'.*?>|</[pb]>"
print(f"HTML non-capturing split: {re.split(pattern3, html)}")

Substituting Text

re.sub(pattern, replacement, string, count=0) replaces all (or a limited number) of matches.
count controls how many replacements to make (default 0 = all).
Back-references (\1, \g<name>) let you reorder or reuse captured text in the replacement.

import re

text = "User IDs: user123, user456, user123457689. Contact admin789 for help."

# Basic substitution
redacted = re.sub(r"user\d+", "[REDACTED_USER]", text)
print(f"Result of redacting: {redacted}")

# Back-reference for reusing information
redacted_partially = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text)
print(f"Result of redacting: {redacted_partially}")

# Limited count of substitutions
redacted_only_two = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text, count=2)
print(f"Result of redacting: {redacted_only_two}")

# Named groups for substitution
date_text = "Start: 2023-10-27, End: 2024-01-15"
# Current format YYYY-MM-DD
# Target format DD/MM/YYYY

date_pattern_named = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
replacement_format_named = r"\g<day>/\g<month>/\g<year>"
reformatted_date = re.sub(date_pattern_named, replacement_format_named, date_text)

print(f"Result of date transformation: {reformatted_date}")

Regex Essentials: Overview

Introduction to re.search() vs re.match()

Common Metacharacters

Special Sequences

Quantifier Cheat-Sheet

Quantifiers & Greedy vs Non-Greedy

Capturing Groups and Back-References

Capturing Groups

Named Capturing Groups

Non-Capturing Groups

Back-references

Search, Split, and Substitute

Finding All Matches

Splitting Strings

Substituting Text

Introduction to `re.search()` vs `re.match()`