Delete Set public Set private Add tags Delete tags
  Add tag   Cancel
  Delete tag   Cancel
  • • DevOps notes •
  •  
  • AI
  • Tags
  • Login

Regex/shaare/YC8opA

  • python
  • python

Regex Essentials: Overview

  • Regular expressions (regex) are a language for defining text search patterns.
  • Python’s re module provides functions like search (find anywhere) and match (anchored at start).
  • Patterns include literals, metacharacters (. ^ $ * + ? [] \), character classes (\d, \w, \s), and quantifiers (*, +, ?, {n,m}).
  • Greedy quantifiers (*, +) match as much as possible; non-greedy (*?, +?) as little as possible.

Introduction to re.search() vs re.match()

  • re.search(pattern, text) scans the entire string for the first occurrence.
  • re.match(pattern, text) checks only at the beginning of the string.
  • re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
  • Always use raw strings (r"...") to define regex patterns, avoiding Python string escapes interfering with regex.
import re

line = "WARN: Disk usage at 91%"
pattern = r"WARN"

print(f"search '{pattern}':", bool(re.search(pattern, line)))
print(f"match '{pattern}':", bool(re.match(pattern, line)))

Common Metacharacters

  • . matches any character (except newline).
  • ^ anchors at start of string.
  • $ anchors at end of string.
  • [] defines a set or range of characters, e.g. [A-Z].
  • \ escapes metacharacters or introduces special sequences.
import re

test = "Error code: E1234. cxge"

print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Start anchor (finds): {re.findall(r"^Error", test)}")
print(f"Start anchor (does not find): {re.findall(r"^E1234", test)}")
print(f"End anchor: {re.findall(r"cxge$", test)}")
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

Special Sequences

  • \d digit (0–9), \D non-digit.
  • \w word character (letters, digits, underscore), \W non-word.
  • \s whitespace, \S non-whitespace.
  • \b word boundary (zero-width match).
import re

text = "The cat scattered 1024 catalogues."

print(f"Digits: {re.findall(r"\d+", text)}")
print(f"Word characters: {re.findall(r"\w+", text)}")
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}")

Quantifier Cheat-Sheet

Quantifier Meaning Greedy? Non-greedy form Meaning
? 0 or 1 of the preceding token Yes ?? as few as possible (0 or 1)
* 0 or more of the preceding token Yes *? as few as possible (including zero)
+ 1 or more of the preceding token Yes +? as few as possible (at least one)
{n} exactly n of the preceding token - - -
{n,} n or more of the preceding token Yes {n,}? n or more, but as few as possible
{n,m} between n and m of the preceding token Yes {n,m}? between n and m, but as few as possible
import re

text = "aaaa"

print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")

Quantifiers & Greedy vs Non-Greedy

  • * / + / {n,} are greedy: match as much as possible.
  • Append ? (*? / +? / {n,}?) to make them non-greedy: match as little as possible.
  • Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a ? after them makes them non-greedy (or lazy), matching the shortest possible string.
import re

html = "<>
</>"

print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")

Capturing Groups and Back-References

  • Regex lets you check for patterns, but often you need to extract pieces of the match (e.g., IP vs port).
  • Capturing groups, defined with (), let you isolate and retrieve substrings from a match.
  • Named groups improve readability by giving meaningful labels instead of relying on group numbers.
  • Non-capturing groups (?:…) let you apply grouping logic without cluttering captures.
  • Back-references allow you to match the same text twice (or more) within one pattern.

Capturing Groups

  • Parentheses () both group and capture the matched text inside them.
  • Groups are numbered by their opening (, starting at 1; group 0 is the entire match.
  • Use match.group(n) for a single group or match.groups() to get all captures as a tuple.
  • Capturing is essential when you need to feed specific substrings into further processing.
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Our goal:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(\w+)\s+User=(\w+).*?\s+IP=([\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group(1)}")
    print(f"User: {match.group(2)}")
    print(f"IP: {match.group(3)}")
    print(f"All groups: {match.groups()}")

Named Capturing Groups

  • Syntax: (?P<name>pattern) assigns a label to a capturing group.
  • Access by name: match.group('name') makes code self-documenting.
  • match.groupdict() returns a dict of all named captures.
  • You can still use numeric indices if needed, but names help avoid off-by-one errors.
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Add labels to:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(?P<level>\w+)\s+User=(?P<user>\w+).*?\s+IP=(?P<ip>[\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group("level")}")
    print(f"User: {match.group("user")}")
    print(f"IP: {match.group("ip")}")
    print(f"All groups: {match.groups()}")
    print(f"Group dictionary: {match.groupdict()}")

Non-Capturing Groups

  • Use (?:pattern) when you need grouping for quantifiers or alternation without capturing.
  • Keeps your capture numbers focused on what you actually want.
  • Prevents unwanted None entries in match.groups() when using optional parts.
import re

log_line1 = "report.txt Status: OK"
log_line2 = "report Status: OK"

# Our goal:
# 1. Group 1: The stem of the filename, with .txt being an optional string
# 2. Group 2: The status code

pattern = r"^(.+?)(?:\.txt)?\s+Status:\s+(.+)$"

match_line1 = re.search(pattern, log_line1)
match_line2 = re.search(pattern, log_line2)

if match_line1: print(match_line1.groups())
if match_line2: print(match_line2.groups())

Back-references

  • Refer back to a previous capture using \1, \2, … or (?P=name) for named groups.
  • Useful for matching repeated words or balanced constructs (e.g., open/close tags).
  • Can make patterns more complex but powerful for advanced text validation.
import re

text = "This this is a test test."
pattern_numbers = r"(?i)\b(\w+)\s+\1\b"
pattern_labels = r"(?i)\b(?P<word>\w+)\s+(?P=word)\b"

print(f"Doubled words: {re.findall(pattern_numbers, text)}")
print(f"Doubled words: {re.findall(pattern_labels, text)}")

html = "<b>Bold</b>"
pattern_tags = r"<(\w+)>(.*?)</\1>"

print(f"Tags: {re.findall(pattern_tags, html)}")

Search, Split, and Substitute

  • re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
  • re.split() handles complex delimiters beyond simple string splits.
  • re.sub() performs powerful search-and-replace operations, including reuse of captured groups.

Finding All Matches

  • re.findall(pattern, string) returns a list of all non-overlapping matches:
    • No groups → list of matched substrings.
    • With groups → list of tuples of captured substrings.
  • re.finditer(pattern, string) returns an iterator of match objects, giving access to .group(), positions, named groups, etc., and is more memory-efficient for large inputs.
import re

text = "Errors found: 404, 500, 403, 500. User IDs: user123, admin99."
config = "timeout=60 retries=3 workers=5"

# Find all error codes:
print(f"Numbers found: {re.findall(r"\d+", text)}")

# findall with groups:
print(f"Key-value pairs: {re.findall(r"(\w+)=(\w+)", config)}")

# finditer
for match in re.finditer(r"(\w+)=(\w+)", config):
    print(f"Whole match: {match.group(0)}; key: {match.group(1)}; value: {match.group(2)} - at {match.start()}-{match.end()}")

Splitting Strings

  • Use re.split(pattern, string) to break a string on a regex pattern, not just a fixed substring.
  • Always use a raw string literal so backslashes reach the regex engine.
  • Simple single-character delimiters: use a character class (never captured), e.g. r"\s*[,;]\s*".
  • Complex delimiters (alternation or multi-character): group with non-capturing parentheses, e.g. r"\s*(?:foo|bar|baz)\s*", so they aren’t included in the result list.
  • Including delimiters: wrap your delimiter in a capturing group, e.g. r"\s*([,;])\s*", to have the separators appear in the split output.
  • Summary:
    • No parentheses or a non-capturing group → delimiters are removed.
    • Capturing group → delimiters appear in the split list.
import re

data = "item1 , item2; item3 ,item4 ;item5"

# 1. Split on comma and semi-colon
pattern1 = r"\s*[,;]\s*"
print(f"Character class split: {re.split(pattern1, data)}")

# 2. Capturing the delimiter
pattern2 = r"\s*([,;])\s*"
print(f"Capturing group split: {re.split(pattern2, data)}")

html = """
<b class='world'>Second paragraph.</b>
End.
"""

pattern3 = r"<.*?class='(?:hello|world)'.*?>|</[pb]>"
print(f"HTML non-capturing split: {re.split(pattern3, html)}")

Substituting Text

  • re.sub(pattern, replacement, string, count=0) replaces all (or a limited number) of matches.
  • count controls how many replacements to make (default 0 = all).
  • Back-references (\1, \g<name>) let you reorder or reuse captured text in the replacement.
import re

text = "User IDs: user123, user456, user123457689. Contact admin789 for help."

# Basic substitution
redacted = re.sub(r"user\d+", "[REDACTED_USER]", text)
print(f"Result of redacting: {redacted}")

# Back-reference for reusing information
redacted_partially = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text)
print(f"Result of redacting: {redacted_partially}")

# Limited count of substitutions
redacted_only_two = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text, count=2)
print(f"Result of redacting: {redacted_only_two}")

# Named groups for substitution
date_text = "Start: 2023-10-27, End: 2024-01-15"
# Current format YYYY-MM-DD
# Target format DD/MM/YYYY

date_pattern_named = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
replacement_format_named = r"\g<day>/\g<month>/\g<year>"
reformatted_date = re.sub(date_pattern_named, replacement_format_named, date_text)

print(f"Result of date transformation: {reformatted_date}")
1 month ago Permalink
cluster icon
  • Typing : Introduction Python is a dynamically typed language, meaning you can assign values to variables without declaring their types, and type checking happ...
  • Dictionaries : Dictionaries (dict) Dictionaries are mutable, insertion-ordered collections of key-value pairs. Keys must be unique and immutable; values can be of an...
  • Generators and Lazy Pipelines : Generators and Lazy Pipelines You can chain generator functions to form multi-stage data pipelines that process items one at a time. No intermediat...
  • Context managers : Context Managers When opening files or acquiring locks, resources must be released even if errors occur. Manual try...finally ensures cleanup but a...
  • Running Python modules : Running Scripts: python -m vs. python file.py The Core Difference: What is "Entry Point Zero"? The key to understanding the difference lies in the fir...


(97)
Filter untagged links
Fold Fold all Expand Expand all Are you sure you want to delete this link? Are you sure you want to delete this tag? The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community