Python Type Hints - How to Work with Regular Expressions

2021-09-07 Searching again for a needle in a haystack.

Python’s re module lets us search both str and bytes strings with regular expressions (regexes). Our type checker can ensure we call re functions with the correct types, thanks to some parametrized classes.

(Unfortunately, it’s still up to us to check our regexes are correct!)

str versus bytes

The re module operates in two modes - unicode, which is the default and operates on strs, and 8-bit, which is operates on byteses. When we use a re function, the pattern and target string must have the same type, or the function will raise a TypeError.

Type checkers protect us against such mismatches. As such, two types in the re module are parametrized with str or bytes:

Before Python 3.9, the re.Match and re.Pattern classes did not support parametrization themselves. Instead, we need to use the equivalent typing.Match and typing.Pattern classes.

Let’s look at these two types in order.

Matches

Several functions in the re module can return match objects, such as re.search():

import re

found = re.search(
    r"MacGuffin",
    "The Guardians then retrieve the MacGuffin.",
)
reveal_type(found)
if found:
    reveal_type(found)

We’ve added some calls to reveal_type() to check the types. Running Mypy gives us:

$ mypy example.py
example.py:8: note: Revealed type is "Union[typing.Match[builtins.str*], None]"
example.py:10: note: Revealed type is "typing.Match[builtins.str*]"

The first revealed type shows that, since we used strs for the pattern and target string, the return type of re.search() is re.Match[str] | None. (Mypy uses the old, long-form spelling for union, and the old location typing.Match instead of re.Match.)

The second revealed type shows us that within the if block Mypy’s type narrowing can discard None, inferring that found must be a re.Match[str].

For most regex use cases we don’t need to mention Match explicitly in our code. We can rest easy knowing that using its attributes and methods will return the correct string type. But if we write a function that accepts a match object, we should use re.Match in the type hint (or typing.Match on Python < 3.9):

import re


def display_match(match: re.Match[str]) -> None:
    print(f"Found {match[0]} at {match.start(0)}")


found = re.search(
    r"MacGuffin",
    "The Guardians then retrieve the MacGuffin.",
)
if found:
    display_match(found)

Noice.

Everything looks similar when using the 8-bit (bytes) mode:

import re


def process_match(match: re.Match[bytes]) -> None:
    ...


found = re.match(br"\x12", b"\x00\x00\x12\x34\x00\x00")
if found:
    process_match(found)

Patterns

Calling re.compile() returns a Pattern:

import re

groot_re = re.compile(r"\bgroot\b", re.IGNORECASE)
reveal_type(groot_re)

Running Mypy:

$ mypy example.py
example.py:4: note: Revealed type is "typing.Pattern[builtins.str*]"

We can see that since the input string was a str, the pattern is of type Pattern[str]. (Again, Mypy displays with old typing.Pattern instead of re.Pattern.)

The parametrization means that the pattern’s methods accept str inputs, rather than byteses. We can check this with another reveal_type() call:

reveal_type(groot_re.fullmatch)

Mypy shows us:

$ mypy example.py
example.py:4: note: Revealed type is "def (string: builtins.str*, pos: builtins.int =, endpos: builtins.int =) -> Union[typing.Match[builtins.str*], None]"

Mypy reports that the string argument must be a str, and the function returns Match[str] | None.

Fin

I hope this post matched what you were looking for,

—Adam


🎉 My book Speed Up Your Django Tests is now up to date for Django 3.2. 🎉
Buy now on Gumroad


Subscribe via RSS, Twitter, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: mypy, python