Regular Expressions
Authors: Frank Mayer
Regular expressions (regex) are patterns that find text. They help you search, validate, and transform strings.
Where you will use regex:
- Editors: Visual Studio Code, Visual Studio, Zed, Vim, Emacs
- Languages: most modern languages include regex engines
- Terminal tools:
sed,grep,awk,vim,emacs - Databases: some, like PostgreSQL, support regex
Regex is everywhere. Learning it pays off.
Basic text (no special characters)
A pattern matches itself. Example:
- Pattern
hellomatches the texthello - It does not match
Hello(case sensitive unless you use theiflag)
^ and $ anchors (start and end of line)
^matches the start of a line$matches the end of a line
Examples:
^hellomatcheshelloonly at the startworld$matchesworldonly at the end
In multi-line mode (m flag), ^ and $ work per line.
. any single character
.matches any one character except newline in many engines- Some engines have the
sflag (dotall) so.also matches newlines
Example:
h.tmatcheshat,hot,h9t
[] set of characters
[abc]matches one character that isa,b, orc- Example:
gr[ae]ymatchesgrayorgrey
[-] character range
Use - for an inclusive range.
You can use numbers or letters.
Example:
[a-f]matchesathroughf[0-3]matches0through3[A-Z]matchesAthroughZ
Your range can’t overlap this three character classes: digits, lowercase letters, uppercase letters.
This would be invalid: [g-9] or [a-Z].
[^] negated set of characters
[^abc]matches one character that is nota,b, orc[^0-9]matches a non-digit
Example:
gr[^ae]ymatchesgrOybut notgrayorgrey
() grouping and | alternation (or)
()groups parts so you can apply operators to the group()also captures matched text for later use|means “or”
Examples:
gr(e|a)ymatchesgreyorgraydog|catmatchesdogorcat(ha)+matchesha,hahaha, etc.
{} counts (quantifiers)
- Exact:
a{3}matchesaaa - Range:
a{2,4}matchesaa,aaa, oraaaa - Minimum:
a{2,}matchesaaand more - Maximum: some engines support
a{,3}(max only). Perl/PCRE and JavaScript do not. Usea{0,3}instead for portability.
Note: quantifiers apply to the token before them. Use () to scope:
(ab){3}repeatsabthree times
Escaping with \
- To match a special character literally, escape it
- Example:
\.matches a literal dot - Common specials:
.^$|?*+()[]{}`
Example:
price\$matchesprice$at the end of a line
Optional ?
?means “zero or one”- Example:
colou?rmatchescolorandcolour
+ one or more
+means “one or more”- Example:
\d+matches1,23,4567
* zero or more
*means “zero or more”- Example:
go*glematchesggle,gogle,google,gooogle, etc.
\d, \w, \s shorthands
\ddigit (0–9)\w“word” character (letters, digits, underscore)\swhitespace (space, tab, newline, etc.)
Example:
\w+matches a word-like token\d{4}-\d{2}-\d{2}matches a simple date form
Note: POSIX uses [[:digit:]], [[:alnum:]], [[:space:]] instead.
\D, \W, \S negated shorthands
\Dnon-digit\Wnon-word character\Snon-whitespace
Example:
\S+matches a run of non-space characters
Real-world practice
Put together what you just learned.
Try to write the pattern before opening the solution.
Match a US ZIP code (5 digits, optional -4)
Text examples:
9410530301-1234
Don’t matche phone numbers:
1234567891234567890
Output:
Solution
^\d{5}(-\d{4})?$5 digits, optional hyphen and 4 digits.
European date
Match the european date format DD.MM.YYYY with optional whitespace and leading zeros.
Don’t match US-style dates like 1999-01-01 or other formats.
Output:
Solution
^\d{1,2}\.\s*\d{1,2}\.\s*\d{4}$
1-2 digits, dot, 1-2 digits, dot, 4 digits.
Simple email
Text examples:
Output:
Solution
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
User part, @, domain, dot, TLD with 2+ letters.
In the real world, don’t use regex to validate an email address.
A 99% solution (nearly perfect) is ^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$.
It’s super complicated, but still has some cases where its wrong.
Just make sure it contains an @ and a . and your’re fine.
Hex color #RGB or #RRGGBB
No alpha.
Text examples:
#09f#1A2B3C
Invalid inputs:
#1234567890#zr4#iAZQ8n#1234
Output:
Solution
^#([A-Fa-f0-9]{3}|[A-Fa-f0-9]{6})$
Either 3 or 6 hex characters 0 to 9 and a to f.
POSIX vs Perl-style (PCRE) regex
Two big families:
- POSIX (BRE/ERE)
- Uses character classes like
[[:digit:]],[[:alpha:]] - Alternation
|is in ERE (useegreporgrep -E) - BRE vs ERE differ in what you must escape
- Leftmost-longest match rule
- Fewer features (no non-greedy
*?, limited/backreferences)
- Uses character classes like
- Perl/PCRE (the de facto standard)
- Uses
\d,\w,\s, groups, lookarounds, lazy quantifiers - Most common in languages and tools today (
PCRE, JavaScript-inspired) - Behavior is what most tutorials refer to
- Uses
Note: some features exist only in certain engines (for example, some browser versions). Perl/PCRE-style is the most portable default.
Tips on portability
- Prefer Perl/PCRE-style features for cross-tool consistency
- Avoid engine-specific extras unless you control the runtime
- For POSIX tools, use
[[:digit:]],[[:space:]], etc. - Test patterns in your editor or tool before committing
Summary
- Build from simple text
- Use anchors
^and$ - Use
.and[]to control what you match - Group with
()and choose with| - Control counts with
{},?,+,* - Use
\d,\w,\sand their opposites for shorthand - Check engine docs for small differences (Perl/PCRE is the common base)