Supprimer Rendre public Rendre privé Add tags Delete tags
  Ajouter un tag   Annuler
  Supprimer le tag   Annuler
  • • DevOps notes •
  •  
  • AI
  • Tags
  • Connexion

Regex/shaare/YC8opA

  • python
  • python

Regex Essentials: Overview

  • Regular expressions (regex) are a language for defining text search patterns.
  • Python’s re module provides functions like search (find anywhere) and match (anchored at start).
  • Patterns include literals, metacharacters (. ^ $ * + ? [] \), character classes (\d, \w, \s), and quantifiers (*, +, ?, {n,m}).
  • Greedy quantifiers (*, +) match as much as possible; non-greedy (*?, +?) as little as possible.

Introduction to re.search() vs re.match()

  • re.search(pattern, text) scans the entire string for the first occurrence.
  • re.match(pattern, text) checks only at the beginning of the string.
  • re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
  • Always use raw strings (r"...") to define regex patterns, avoiding Python string escapes interfering with regex.
import re

line = "WARN: Disk usage at 91%"
pattern = r"WARN"

print(f"search '{pattern}':", bool(re.search(pattern, line)))
print(f"match '{pattern}':", bool(re.match(pattern, line)))

Common Metacharacters

  • . matches any character (except newline).
  • ^ anchors at start of string.
  • $ anchors at end of string.
  • [] defines a set or range of characters, e.g. [A-Z].
  • \ escapes metacharacters or introduces special sequences.
import re

test = "Error code: E1234. cxge"

print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Start anchor (finds): {re.findall(r"^Error", test)}")
print(f"Start anchor (does not find): {re.findall(r"^E1234", test)}")
print(f"End anchor: {re.findall(r"cxge$", test)}")
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

Special Sequences

  • \d digit (0–9), \D non-digit.
  • \w word character (letters, digits, underscore), \W non-word.
  • \s whitespace, \S non-whitespace.
  • \b word boundary (zero-width match).
import re

text = "The cat scattered 1024 catalogues."

print(f"Digits: {re.findall(r"\d+", text)}")
print(f"Word characters: {re.findall(r"\w+", text)}")
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}")

Quantifier Cheat-Sheet

Quantifier Meaning Greedy? Non-greedy form Meaning
? 0 or 1 of the preceding token Yes ?? as few as possible (0 or 1)
* 0 or more of the preceding token Yes *? as few as possible (including zero)
+ 1 or more of the preceding token Yes +? as few as possible (at least one)
{n} exactly n of the preceding token - - -
{n,} n or more of the preceding token Yes {n,}? n or more, but as few as possible
{n,m} between n and m of the preceding token Yes {n,m}? between n and m, but as few as possible
import re

text = "aaaa"

print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")

Quantifiers & Greedy vs Non-Greedy

  • * / + / {n,} are greedy: match as much as possible.
  • Append ? (*? / +? / {n,}?) to make them non-greedy: match as little as possible.
  • Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a ? after them makes them non-greedy (or lazy), matching the shortest possible string.
import re

html = "<p>One</p><p>Two</p><></>"

print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")

Capturing Groups and Back-References

  • Regex lets you check for patterns, but often you need to extract pieces of the match (e.g., IP vs port).
  • Capturing groups, defined with (), let you isolate and retrieve substrings from a match.
  • Named groups improve readability by giving meaningful labels instead of relying on group numbers.
  • Non-capturing groups (?:…) let you apply grouping logic without cluttering captures.
  • Back-references allow you to match the same text twice (or more) within one pattern.

Capturing Groups

  • Parentheses () both group and capture the matched text inside them.
  • Groups are numbered by their opening (, starting at 1; group 0 is the entire match.
  • Use match.group(n) for a single group or match.groups() to get all captures as a tuple.
  • Capturing is essential when you need to feed specific substrings into further processing.
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Our goal:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(\w+)\s+User=(\w+).*?\s+IP=([\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group(1)}")
    print(f"User: {match.group(2)}")
    print(f"IP: {match.group(3)}")
    print(f"All groups: {match.groups()}")

Named Capturing Groups

  • Syntax: (?P<name>pattern) assigns a label to a capturing group.
  • Access by name: match.group('name') makes code self-documenting.
  • match.groupdict() returns a dict of all named captures.
  • You can still use numeric indices if needed, but names help avoid off-by-one errors.
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Add labels to:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(?P<level>\w+)\s+User=(?P<user>\w+).*?\s+IP=(?P<ip>[\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group("level")}")
    print(f"User: {match.group("user")}")
    print(f"IP: {match.group("ip")}")
    print(f"All groups: {match.groups()}")
    print(f"Group dictionary: {match.groupdict()}")

Non-Capturing Groups

  • Use (?:pattern) when you need grouping for quantifiers or alternation without capturing.
  • Keeps your capture numbers focused on what you actually want.
  • Prevents unwanted None entries in match.groups() when using optional parts.
import re

log_line1 = "report.txt Status: OK"
log_line2 = "report Status: OK"

# Our goal:
# 1. Group 1: The stem of the filename, with .txt being an optional string
# 2. Group 2: The status code

pattern = r"^(.+?)(?:\.txt)?\s+Status:\s+(.+)$"

match_line1 = re.search(pattern, log_line1)
match_line2 = re.search(pattern, log_line2)

if match_line1: print(match_line1.groups())
if match_line2: print(match_line2.groups())

Back-references

  • Refer back to a previous capture using \1, \2, … or (?P=name) for named groups.
  • Useful for matching repeated words or balanced constructs (e.g., open/close tags).
  • Can make patterns more complex but powerful for advanced text validation.
import re

text = "This this is a test test."
pattern_numbers = r"(?i)\b(\w+)\s+\1\b"
pattern_labels = r"(?i)\b(?P<word>\w+)\s+(?P=word)\b"

print(f"Doubled words: {re.findall(pattern_numbers, text)}")
print(f"Doubled words: {re.findall(pattern_labels, text)}")

html = "<p>Paragraph</p> <b>Bold</b>"
pattern_tags = r"<(\w+)>(.*?)</\1>"

print(f"Tags: {re.findall(pattern_tags, html)}")

Search, Split, and Substitute

  • re.findall() and re.finditer() let you retrieve every occurrence of a pattern.
  • re.split() handles complex delimiters beyond simple string splits.
  • re.sub() performs powerful search-and-replace operations, including reuse of captured groups.

Finding All Matches

  • re.findall(pattern, string) returns a list of all non-overlapping matches:
    • No groups → list of matched substrings.
    • With groups → list of tuples of captured substrings.
  • re.finditer(pattern, string) returns an iterator of match objects, giving access to .group(), positions, named groups, etc., and is more memory-efficient for large inputs.
import re

text = "Errors found: 404, 500, 403, 500. User IDs: user123, admin99."
config = "timeout=60 retries=3 workers=5"

# Find all error codes:
print(f"Numbers found: {re.findall(r"\d+", text)}")

# findall with groups:
print(f"Key-value pairs: {re.findall(r"(\w+)=(\w+)", config)}")

# finditer
for match in re.finditer(r"(\w+)=(\w+)", config):
    print(f"Whole match: {match.group(0)}; key: {match.group(1)}; value: {match.group(2)} - at {match.start()}-{match.end()}")

Splitting Strings

  • Use re.split(pattern, string) to break a string on a regex pattern, not just a fixed substring.
  • Always use a raw string literal so backslashes reach the regex engine.
  • Simple single-character delimiters: use a character class (never captured), e.g. r"\s*[,;]\s*".
  • Complex delimiters (alternation or multi-character): group with non-capturing parentheses, e.g. r"\s*(?:foo|bar|baz)\s*", so they aren’t included in the result list.
  • Including delimiters: wrap your delimiter in a capturing group, e.g. r"\s*([,;])\s*", to have the separators appear in the split output.
  • Summary:
    • No parentheses or a non-capturing group → delimiters are removed.
    • Capturing group → delimiters appear in the split list.
import re

data = "item1 , item2; item3 ,item4 ;item5"

# 1. Split on comma and semi-colon
pattern1 = r"\s*[,;]\s*"
print(f"Character class split: {re.split(pattern1, data)}")

# 2. Capturing the delimiter
pattern2 = r"\s*([,;])\s*"
print(f"Capturing group split: {re.split(pattern2, data)}")

html = """
<p class='hello'>First paragraph.</p>
<b class='world'>Second paragraph.</b>
End.
"""

pattern3 = r"<.*?class='(?:hello|world)'.*?>|</[pb]>"
print(f"HTML non-capturing split: {re.split(pattern3, html)}")

Substituting Text

  • re.sub(pattern, replacement, string, count=0) replaces all (or a limited number) of matches.
  • count controls how many replacements to make (default 0 = all).
  • Back-references (\1, \g<name>) let you reorder or reuse captured text in the replacement.
import re

text = "User IDs: user123, user456, user123457689. Contact admin789 for help."

# Basic substitution
redacted = re.sub(r"user\d+", "[REDACTED_USER]", text)
print(f"Result of redacting: {redacted}")

# Back-reference for reusing information
redacted_partially = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text)
print(f"Result of redacting: {redacted_partially}")

# Limited count of substitutions
redacted_only_two = re.sub(r"(u)ser\d+(\d{2})", r"\1[REDACTED_USER]\2", text, count=2)
print(f"Result of redacting: {redacted_only_two}")

# Named groups for substitution
date_text = "Start: 2023-10-27, End: 2024-01-15"
# Current format YYYY-MM-DD
# Target format DD/MM/YYYY

date_pattern_named = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
replacement_format_named = r"\g<day>/\g<month>/\g<year>"
reformatted_date = re.sub(date_pattern_named, replacement_format_named, date_text)

print(f"Result of date transformation: {reformatted_date}")
1 month ago Permalien
cluster icon
  • Filesystem Operations : Filesystem Operations (os & shutil) DevOps scripts often need to create, delete, copy, and move files and directories as part of automation workflows...
  • Custom Exceptions: Tailoring Error Signals : Custom Exceptions: Tailoring Error Signals Built-in exceptions are great, but often too generic for application-specific failures. A custom excepti...
  • Handling Authentication : Handling Authentication APIs often require authentication to control access, rate limits, and auditing. Without authentication, requests to protected...
  • Running Python modules : Running Scripts: python -m vs. python file.py The Core Difference: What is "Entry Point Zero"? The key to understanding the difference lies in the fir...
  • Configuring Pytest : Configuring Pytest As you start using Pytest extensively, typing -v or -m on the command line every time becomes tedious. Centralize your defaults in...


(110)
Filtrer par liens sans tag
Replier Replier tout Déplier Déplier tout Êtes-vous sûr de vouloir supprimer ce lien ? Êtes-vous sûr de vouloir supprimer ce tag ? Le gestionnaire de marque-pages personnel, minimaliste, et sans base de données par la communauté Shaarli