Regular Expressions with the re Module
Regular expressions (regex) are a powerful tool for working with text. They let you define patterns to match, extract, and replace strings. Python’s re module provides these capabilities as part of the standard library.
Why Use Regular Expressions?
Suppose you need to find all email addresses in a block of text, validate that a phone number follows a specific format, or replace all URLs with links. Writing manual string parsing code for each of these tasks gets messy fast.
Regular expressions solve this by letting you describe patterns with a compact syntax. Instead of writing loops and conditionals, you write a pattern like \d{3}-\d{4} to match phone numbers, and the regex engine handles the rest.
Your First Regex
The simplest way to use regex in Python is with re.search():
import re
text = "My phone number is 555-1234."
match = re.search(r"\d{3}-\d{4}", text)
if match:
print(f"Found: {match.group()}")
# output: Found: 555-1234
The r prefix creates a raw string, which avoids issues with backslash escape sequences. The \d{3} matches exactly three digits, and \d{4} matches exactly four digits.
Common Patterns
Here are patterns you will use most often:
| Pattern | Matches |
|---|---|
\d | Any digit (0-9) |
\D | Any non-digit |
\w | Word character (a-z, A-Z, 0-9, _) |
\W | Non-word character |
\s | Whitespace (space, tab, newline) |
\S | Non-whitespace |
. | Any character except newline |
^ | Start of string |
$ | End of string |
You can combine these with quantifiers:
import re
# Match one or more digits
re.search(r"\d+", "123 abc") # Matches "123"
# Match zero or more digits
re.search(r"\d*", "abc") # Matches "" (empty)
# Match exactly 3 letters
re.search(r"[a-zA-Z]{3}", "abcdef") # Matches "abc"
# Optional character
re.search(r"colou?r", "color") # Matches "color"
re.search(r"colou?r", "colour") # Matches "colour"
Finding All Matches
Use re.findall() when you need every match in a string:
import re
text = "There are 3 apples, 5 oranges, and 2 bananas."
numbers = re.findall(r"\d+", text)
print(numbers)
# output: ['3', '5', '2']
# Extract emails from text
emails = """Contact us at info@example.com or support@company.org"""
email_pattern = r"\w+@\w+\.\w+"
found_emails = re.findall(email_pattern, emails)
print(found_emails)
# output: ['info@example.com', 'support@company.org']
The findall() function returns a list of all non-overlapping matches.
Match Objects
When you need more than just the matched text, use re.search() or re.match() which return match objects:
import re
text = "Price: $19.99"
match = re.search(r"\$(\d+\.\d{2})", text)
if match:
print(f"Full match: {match.group(0)}") # $19.99
print(f"Captured: {match.group(1)}") # 19.99
print(f"Position: {match.span()}") # (7, 13)
The parentheses create capture groups. group(0) returns the entire match, while group(1) returns the first captured group.
Splitting and Replacing
The re.sub() function replaces matches with new text:
import re
text = "Hello World !"
# Replace multiple spaces with single space
cleaned = re.sub(r" +", " ", text)
print(cleaned)
# output: Hello World !
# Replace numbers with placeholder
text = "Call 555-1234 for help"
anonymized = re.sub(r"\d{3}-\d{4}", "XXX-XXXX", text)
print(anonymized)
# output: Call XXX-XXXX for help
# Use capture groups in replacement
text = "2026-03-13"
reformatted = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", text)
print(reformatted)
# output: 13/03/2026
The re.split() function splits on patterns:
import re
text = "apple, banana; cherry: date"
# Split on any non-alphanumeric
fruits = re.split(r"[,;:]", text)
print(fruits)
# output: ['apple', ' banana', ' cherry', ' date']
# Split and remove whitespace
fruits = [f.strip() for f in re.split(r"[,;:]", text)]
print(fruits)
# output: ['apple', 'banana', 'cherry', 'date']
Compiling Patterns
If you use the same pattern repeatedly, compile it for better performance:
import re
# Compile once
phone_pattern = re.compile(r"\d{3}-\d{4}")
# Use the compiled pattern
numbers = ["555-1234", "555-5678", "123-4567"]
for number in numbers:
match = phone_pattern.search(number)
if match:
print(f"Valid: {number}")
# output: Valid: 555-1234
# output: Valid: 555-5678
Compiled patterns also let you add flags and use methods directly:
import re
pattern = re.compile(r"\b\w+\b", re.IGNORECASE)
# Find all words containing "py"
matches = pattern.findall("Python py PYTHON")
print(matches)
# output: ['Python', 'py', 'PYTHON']
Practical Examples
Validating Email Addresses
import re
def is_valid_email(email):
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return re.match(pattern, email) is not None
print(is_valid_email("user@example.com")) # True
print(is_valid_email("invalid-email")) # False
print(is_valid_email("user@domain")) # False
Extracting Data from Log Files
import re
log_line = "2026-03-13 14:30:45 ERROR Connection failed from 192.168.1.100"
timestamp = re.search(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})", log_line)
level = re.search(r"(ERROR|WARNING|INFO)", log_line)
ip = re.search(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", log_line)
print(f"Time: {timestamp.group(1)}")
print(f"Level: {level.group(1)}")
print(f"IP: {ip.group(1)}")
# output: Time: 2026-03-13 14:30:45
# output: Level: ERROR
# output: IP: 192.168.1.100
Removing HTML Tags
import re
html = "<p>Hello, <strong>World</strong>!</p>"
text = re.sub(r"<[^>]+>", "", html)
print(text)
# output: Hello, World!
Common Pitfalls
Greedy vs. Non-Greedy Matching
The * and + quantifiers are greedy—they match as much as possible. Use *? or +? for non-greedy matching:
import re
html = "<div>content</div>"
# Greedy (captures too much)
re.search(r"<.+>", html).group()
# output: <div>content</div>
# Non-greedy (stops at first match)
re.search(r"<.+?>", html).group()
# output: <div>
Character Classes vs. Predefined Classes
Remember that \d matches digits, but [0-9] does the same. Character classes need explicit ranges:
import re
# \d matches any digit
re.search(r"\d", "abc123") # Matches "1"
# [0-9] matches any digit
re.search(r"[0-9]", "abc123") # Matches "1"
# [abc] matches only a, b, or c
re.search(r"[abc]", "defabc") # Matches "a"
Getting Started
The re module is built into Python, so you can start using it immediately. Begin with simple patterns and gradually add complexity. Remember to use raw strings (r"...") to avoid escape sequence issues.
For complex patterns, build them piece by piece and test each part. The online regex testers can help visualize what your pattern matches.
See Also
- re module — Full reference for the regex module
- collections-module — For working with matched groups as named tuples
- string-module — Alternative string methods for simple cases