Most people meet regular expressions as a wall of symbols copied from a forum, tweak it until it works, and back away slowly. That's a shame, because the underlying idea is simple and the vocabulary is small. This guide builds a pattern from scratch and then names the handful of traps worth knowing.
A pattern is a description
A regular expression describes what a matching string looks like. The engine reads your subject text left to right and, at each position, asks "does the pattern fit starting here?". Everything else — character classes, quantifiers, anchors — is just richer ways to write that description. Keep that picture in mind and the symbols stop being magic.
The fastest way to learn is to change one piece at a time and watch what matches. The regex tester highlights every match as you type, which turns the whole thing into a feedback loop instead of guesswork.
Five pieces cover most patterns
Literals match themselves: cat matches the letters c-a-t. Character classes match one character from a set: \d is any digit, \w is a letter/digit/underscore, \s is whitespace, and [a-f] is your own range. Quantifiers say how many: * is zero-or-more, + is one-or-more, ? is optional, and {2,4} is a specific count. Anchors match a position rather than a character: ^ and $ are the start and end, and \b is a word boundary. Groups and alternation bundle and branch: (...) groups, and cat|dog matches either.
Watch them combine into something useful — a rough date matcher:
\d{4}-\d{2}-\d{2} matches 2026-06-01
^\d{4}-\d{2}-\d{2}$ ...but only if that's the whole string
Read it as a description: four digits, a hyphen, two digits, a hyphen, two digits. Add the anchors and you also say "and nothing else on the line". That's the entire trick — you're spelling out the shape.
Why .* grabs too much
Quantifiers are greedy by default: they match as much as possible, then give back only if the rest of the pattern fails. Run <.*> against <a><b> and it matches the whole thing, not just <a>, because .* swallowed everything before the final >.
Add a ? to make a quantifier lazy — match as little as possible: <.*?> stops at the first > and matches <a>. Greedy versus lazy is one of the most common "why is it matching that?" moments, and flipping a single ? usually fixes it.
Capturing and reusing parts
Parentheses don't just group — they capture. The text each group matched is available afterwards as $1, $2, and so on, which is what makes find-and-replace powerful. Swapping 2026-06-01 to 01/06/2026 is one replace:
pattern: (\d{4})-(\d{2})-(\d{2})
replacement: $3/$2/$1
Most engines also support named groups like (?<year>\d{4}) for readability, though the exact syntax varies by language.
Things that bite everyone once
- Catastrophic backtracking. Nested quantifiers such as
(a+)+$can explode into millions of attempts on certain inputs and hang the program. Avoid overlapping quantifiers, anchor your pattern, and prefer specific classes to.*. - The dot doesn't cross lines. By default
.matches anything except a newline. If your text spans lines, add thes(dotall) flag. - Forgetting to anchor. A pattern that "validates" an email will happily match it inside a longer junk string unless you anchor with
^and$. - Reaching for regex too soon. Don't parse HTML or deeply nested formats with it — use a real parser. Regex shines on flat, predictable text.
The four you'll actually use
g finds every match instead of stopping at the first. i makes matching case-insensitive. m (multiline) makes ^ and $ match at every line break. s (dotall) lets . match newlines too. You can toggle each of these on the tester and see the match set change immediately.
That's the core of it. Regex isn't a secret language for wizards — it's a compact way to describe text, with a few sharp edges. Build patterns one piece at a time, test as you go, and keep them anchored and specific. The line noise turns into something you can read.