Regular Expressions in Python

· 7 min read · Updated March 7, 2026 · intermediate
regex regular-expressions re-module pattern-matching text-processing

Regular expressions (regex) are a powerful tool for working with text. They let you define patterns that match specific strings, validate input, extract data, and perform replacements. If you’ve ever needed to check if an email address looks valid, pull all phone numbers from a block of text, or replace all instances of a date format, regex is the answer.

Python’s re module provides full support for regular expressions. In this tutorial, you’ll learn the basics of regex patterns, how to use the re module’s functions, and when to reach for regex versus other solutions.

What is Regex and When to Use It

A regular expression is a sequence of characters that defines a search pattern. You use this pattern to match strings, extract specific parts, or transform text. Think of it like a sophisticated version of “find and replace” with wildcards built in.

You should use regex when you need to:

  • Validate that input follows a specific format (emails, phone numbers, dates)
  • Extract structured data from unstructured text (IP addresses from logs, prices from product descriptions)
  • Search for complex patterns that simple string methods can’t handle
  • Perform find-and-replace with patterns rather than exact strings

For example, checking if a string contains “hello” is easy with in. But checking if a string is a valid US phone number? That’s a job for regex.

Basic Pattern Matching

Regular expressions combine literal characters with special characters called metacharacters. Let’s start with the simplest case: matching exact strings.

Literal Matching

When you want to match exact text, just write it. The pattern hello matches the string “hello” and nothing else:

import re

pattern = "hello"
text = "hello world"

match = re.search(pattern, text)
if match:
    print("Found:", match.group())  # Found: hello

This works, but it’s not much better than using the in operator. The real power comes from metacharacters.

Essential Metacharacters

These characters have special meaning in regex:

  • . matches any single character except newline
  • * matches zero or more of the preceding element
  • + matches one or more of the preceding element
  • ? matches zero or one of the preceding element
  • ^ matches the start of the string
  • $ matches the end of the string
  • [] defines a character class
  • {} specifies exact repetition count
  • | alternation (OR)

Let’s see these in action:

# . matches any character
print(re.search(r"c.t", "cat"))     # matches "cat"
print(re.search(r"c.t", "cut"))     # matches "cut"
print(re.search(r"c.t", "chart"))   # matches "crt"

# * matches zero or more
print(re.search(r"ab*c", "ac"))     # matches "ac" (b* allows zero b's)
print(re.search(r"ab*c", "abc"))    # matches "abc"
print(re.search(r"ab*c", "abbbbc")) # matches "abbbbc"

# + matches one or more
print(re.search(r"ab+c", "ac"))     # None (b+ requires at least one b)
print(re.search(r"ab+c", "abc"))    # matches "abc"

# ? matches zero or one
print(re.search(r"colou?r", "color"))  # matches "color"
print(re.search(r"colou?r", "colour")) # matches "colour"

Notice I used raw strings (r"pattern") for the regex patterns. This is important in Python because backslashes have special meaning in regular strings. Raw strings treat backslashes literally, so \d stays as two characters that regex interprets as “digit.”

Character Classes

Square brackets create a character class, matching any single character in the set:

# Match any vowel
print(re.search(r"[aeiou]", "cat"))   # matches "a"

# Match any digit
print(re.search(r"[0-9]", "room 42")) # matches "4"

# Match any lowercase letter
print(re.search(r"[a-z]", "Hello"))   # matches "e"

# Negate with ^ inside brackets
print(re.search(r"[^0-9]", "123abc")) # matches "a"

You can also use shorthand character classes:

  • \d matches any digit (0-9)
  • \D matches any non-digit
  • \w matches word characters (letters, digits, underscore)
  • \W matches non-word characters
  • \s matches whitespace (spaces, tabs, newlines)
  • \S matches non-whitespace

Anchors and Boundaries

The ^ and $ anchors match positions rather than characters:

# ^ matches start of string
print(re.search(r"^hello", "hello world"))  # matches
print(re.search(r"^hello", "say hello"))      # None

# $ matches end of string
print(re.search(r"world$", "hello world"))   # matches
print(re.search(r"world$", "world hello"))   # None

# \b matches word boundary
print(re.search(r"\bword\b", "a word here"))  # matches "word"
print(re.search(r"\bword\b", "swordfish"))    # None (no boundary)

The re Module Functions

Python’s re module provides several functions for working with regex. Here’s when to use each one.

compile()

For patterns you use repeatedly, compile them once for better performance:

# Compile once, use many times
phone_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")

print(phone_pattern.search("Call me at 555-123-4567 tomorrow"))
print(phone_pattern.findall("Call 555-123-4567 or 555-987-6543"))

Compiled patterns store flags and can be reused efficiently.

match()

match() checks if the pattern matches at the beginning of the string:

result = re.match(r"\d+", "123abc")
print(result.group())  # 123

result = re.match(r"\d+", "abc123")
print(result)  # None - pattern doesn't start at beginning

This function only checks the start of the string. For most use cases, search() is more flexible.

search() finds the first match anywhere in the string:

result = re.search(r"\d+", "abc123def456")
print(result.group())  # 123 - finds first match

This is probably the most commonly used function for simple pattern validation.

findall()

findall() returns all matches as a list of strings:

text = "My phone is 555-123-4567 and her phone is 555-987-6543"
numbers = re.findall(r"\d+", text)
print(numbers)  # ['555', '123', '4567', '555', '987', '6543']

When you need every phone number or every word matching a pattern, this is the function.

finditer()

finditer() works like findall() but returns an iterator of match objects:

text = "My phone is 555-123-4567 and her phone is 555-987-6543"

for match in re.finditer(r"\d+", text):
    print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")

Use this when you need more information about each match (its position, captured groups) but want to iterate lazily rather than building a full list.

sub() for Replacement

The sub() function replaces matches with replacement text:

text = "The meeting is on 2024-01-15"

# Replace date format
result = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", text)
print(result)  # The meeting is on 15/01/2024

The replacement string can reference captured groups with \1, \2, etc.

Groups and Capturing

Parentheses create capturing groups, letting you extract specific parts of a match:

# Extract username and domain from email
email = "user@example.com"
pattern = r"(\w+)@(\w+)\.(\w+)"
match = re.search(pattern, email)

if match:
    print(match.group(0))   # user@example.com (entire match)
    print(match.group(1))   # user (first group)
    print(match.group(2))   # example (second group)
    print(match.group(3))   # com (third group)

Named groups make this even clearer:

pattern = r"(?P<username>\w+)@(?P<domain>\w+)\.(?P<tld>\w+)"
match = re.search(pattern, "user@example.com")

print(match.group("username"))  # user
print(match.group("domain"))    # example
print(match.group("tld"))        # com

Groups are essential for extraction tasks where you need specific pieces of the matched text.

Practical Examples

Let’s put everything together with real-world examples.

Validating Email Addresses

Here’s a practical email validator:

def is_valid_email(email):
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return re.match(pattern, email) is not None

print(is_valid_email("user@example.com"))      # True
print(is_valid_email("invalid.email@"))        # False
print(is_valid_email("@example.com"))          # False
print(is_valid_email("user@sub.domain.com"))    # True

This pattern checks for the common parts of an email: local part, @ symbol, domain, and TLD.

Extracting Data from Text

Pull structured data from unstructured text:

log_entry = "2024-01-15 14:32:05 ERROR Connection timeout for user_id=42"

# Extract timestamp
timestamp = re.search(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})", log_entry)
print(timestamp.group(1))  # 2024-01-15 14:32:05

# Extract log level
level = re.search(r"\s(ERROR|WARN|INFO|DEBUG)\s", log_entry)
print(level.group(1))  # ERROR

# Extract user ID
user_id = re.search(r"user_id=(\d+)", log_entry)
print(user_id.group(1))  # 42

Replacing Text with Callables

The replacement can be a function that transforms matched text:

import re

def camel_to_snake(name):
    # Insert underscore before uppercase letters, then lowercase
    return re.sub(r"[A-Z]", lambda m: "_" + m.group(0).lower(), name).lstrip("_")

print(camel_to_snake("userName"))     # user_name
print(camel_to_snake("getUserId"))    # get_user_id
print(camel_to_snake("HTTPServer"))    # http_server

When NOT to Use Regex

Regex is powerful, but it’s not always the right tool.

Don’t use regex for:

  • Parsing HTML or XML — use proper parsers like BeautifulSoup or lxml
  • Matching balanced or nested structures — regex can’t handle arbitrary nesting
  • Simple string operations that split(), replace(), or in can handle
  • Complex business logic that would be clearer in regular Python code

Here’s an example of when NOT to use regex:

# Don't do this:
has_dot = re.search(r"\.", "example.com")

# When this works:
has_dot = "." in "example.com"

The second version is clearer, faster, and doesn’t require importing re.

Conclusion

Regular expressions are an essential skill for any Python developer working with text. You now understand how to build patterns with metacharacters, use the re module’s functions for different tasks, capture groups for extraction, and apply regex to real problems like validation and data extraction.

Remember: compile patterns when reusing them, use search() for general matching, findall() when you need every match, and sub() for replacements. And reach for proper parsers when dealing with structured formats like HTML.

What’s Next

Now that you understand the basics, explore more advanced regex topics like lookahead and lookbehind assertions, non-capturing groups, and the verbose flag for readable patterns. These will help you tackle more complex text processing tasks.