SystemCraft

Regular Expressions

Authors:  Frank Mayer

Regular expressions (regex) are patterns that find text. They help you search, validate, and transform strings.

Where you will use regex:

Regex is everywhere. Learning it pays off.

Basic text (no special characters)

A pattern matches itself. Example:

^ and $ anchors (start and end of line)

Examples:

In multi-line mode (m flag), ^ and $ work per line.

. any single character

Example:

[] set of characters

[-] character range

Use - for an inclusive range. You can use numbers or letters.

Example:

Your range can’t overlap this three character classes: digits, lowercase letters, uppercase letters. This would be invalid: [g-9] or [a-Z].

[^] negated set of characters

Example:

() grouping and | alternation (or)

Examples:

{} counts (quantifiers)

Note: quantifiers apply to the token before them. Use () to scope:

Escaping with \

Example:

Optional ?

+ one or more

* zero or more

\d, \w, \s shorthands

Example:

Note: POSIX uses [[:digit:]], [[:alnum:]], [[:space:]] instead.

\D, \W, \S negated shorthands

Example:

Real-world practice

Put together what you just learned.

Try to write the pattern before opening the solution.

Match a US ZIP code (5 digits, optional -4)

Text examples:

Don’t matche phone numbers:

Output:

Solution^\d{5}(-\d{4})?$

5 digits, optional hyphen and 4 digits.

European date

Match the european date format DD.MM.YYYY with optional whitespace and leading zeros. Don’t match US-style dates like 1999-01-01 or other formats.

Output:

Solution

^\d{1,2}\.\s*\d{1,2}\.\s*\d{4}$

1-2 digits, dot, 1-2 digits, dot, 4 digits.

Simple email

Text examples:

Output:

Solution

^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

User part, @, domain, dot, TLD with 2+ letters.

In the real world, don’t use regex to validate an email address. A 99% solution (nearly perfect) is ^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$. It’s super complicated, but still has some cases where its wrong. Just make sure it contains an @ and a . and your’re fine.

Hex color #RGB or #RRGGBB

No alpha.

Text examples:

Invalid inputs:

Output:

Solution

^#([A-Fa-f0-9]{3}|[A-Fa-f0-9]{6})$

Either 3 or 6 hex characters 0 to 9 and a to f.

POSIX vs Perl-style (PCRE) regex

Two big families:

Note: some features exist only in certain engines (for example, some browser versions). Perl/PCRE-style is the most portable default.

Tips on portability

Summary