What is a Regular Expression?
A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. Originally developed in theoretical computer science, regex is now a fundamental tool used across programming languages for text processing, validation, and data extraction.
Regular expressions are used everywhere: form validation in web applications, log file analysis, search-and-replace in text editors, data parsing in ETL pipelines, and URL routing in web frameworks.
Understanding Regex Syntax
Literal Characters
Most characters match themselves literally. The pattern hello matches the string "hello" exactly. However, certain characters have special meanings and must be escaped with a backslash to match literally.
Special Characters (Metacharacters)
These characters have special meanings in regex: . ^ $ * + ? { } [ ] \ | ( ). To match them literally, escape with a backslash: \. matches a period, \* matches an asterisk.
Character Classes
Square brackets define a character class that matches any single character within. [aeiou] matches any vowel. Ranges work too: [a-z] matches any lowercase letter, [0-9] matches any digit. Negate with ^: [^0-9] matches any non-digit.
Shorthand Character Classes
Common patterns have shortcuts: \d (digit), \w (word character: letters, digits, underscore), \s (whitespace). Uppercase versions negate: \D (non-digit), \W (non-word), \S (non-whitespace).
Quantifiers
Quantifiers specify how many times the preceding element should match:
* — Zero or more times (greedy)
+ — One or more times (greedy)
? — Zero or one time (optional)
{n} — Exactly n times
{n,} — n or more times
{n,m} — Between n and m times
Add ? after any quantifier to make it lazy (non-greedy): .*? matches as few characters as possible.
Groups and Capturing
Parentheses create groups that can be quantified together and capture their match for later use:
(abc) — Capturing group: matches "abc" and saves it
(?:abc) — Non-capturing group: groups without saving
(?<name>abc) — Named capture group
\1, \2 — Backreferences to captured groups
$1, $2 — References in replacement strings
Lookahead and Lookbehind
Lookaround assertions match a position without consuming characters:
(?=abc) — Positive lookahead: assert "abc" follows
(?!abc) — Negative lookahead: assert "abc" doesn't follow
(?<=abc) — Positive lookbehind: assert "abc" precedes
(?<!abc) — Negative lookbehind: assert "abc" doesn't precede
Example: \d+(?= dollars) matches digits only if followed by " dollars".
Regex Flags
Flags modify how the pattern is interpreted:
g (global) — Find all matches, not just the first
i (case-insensitive) — Ignore letter case when matching
m (multiline) — ^ and $ match line boundaries
s (dotAll) — Dot . matches newline characters too
u (unicode) — Enable full Unicode matching