Regular Expression (“RegEx” or “RE”) Primer With Examples

 

©2006-2015 by Wayne Pollock, Tampa Florida USA.  All rights reserved.

Introduction

Regular expressions (or REs or RegExs or RegExps) are a way to concisely specify a group of text strings.  The Unix editor ed was about the first (Unix) program to provide REs.  Many later commands used this form of RE and their man pages would "see also ed(1)".  Over time folks wanted more expressive REs and new features were added.  The ed REs became known as basic REs or BREs, and the others became known as extended REs or EREs.

Suppose you needed to find a specific IPv4 address in the files under /etc?  This is easy to do; just specify the IP address as a string of text and do a search.  But, what if you didn't know in advance which IP address you were looking for, only that you wanted to see all IP addresses in those files?  Even if you could you wouldn't want to specify every possible IP address to some searching tool!  You need a way to specify all IP addresses in a compact form.  That is, you want to tell your searching tool to find anything that matches number.number.number.number.  This is the sort of task we use REs for.  You can specify a pattern (RE) for phone numbers, dates, credit-card numbers, email addresses, URLs, and so on.

The Good Enough Principle

With REs the concept of "good enough" applies.  Consider the pattern used above for IP addresses.  It will match any valid IP address, but also strings that look like 7654321.300.0.777 or 5.3.8.12.9.6 (possibly an SNMP OID).

To match only valid IPv4 addresses is possible but rarely worth the effort.  It is unlikely your search of /etc files will find such strings, and if a few turn up you can easily eye-ball them to determine if they are valid IP addresses.

It is possible to craft a more precise RE but in real life you only need an RE good enough for your purpose at hand.  If a few extra matches are caught you can usually deal with them.  (Of course, if making global search and replace commands, you will need to be more precise!)

An RE is a pattern, or template, against which strings can be matched.  Strings either match the pattern or they don't.  If they do, parts of the matching string can be saved in named variables, which can be used later to either match more text, or to transform the matching string.

Pattern matching for text turns out to be one of the most useful and common operations to perform on files.  Over the years a large number of tools have been created that use REs, including all text editors, grep, sed, sort, and others.  The shell wildcards (“globs”) might be considered a type of RE.

While the idea of REs is standard, different tools may use slightly different syntaxes, or dialects.  Some of these tools also contain extensions that may be useful.  Perl's REs are about the most complex and useful dialect, and are sometimes refered to as PREs.  (See man pages for perlre(1), also perlrequick(1), and perlretut(1).)

Eventually POSIX stepped in and standardized RE syntax, it is mostly compatible with the original ed REs but with many additions.  POSIX uses terms like BRE and for “basic regular expression” and ERE for “extended regular expression”.  Some utilities use BREs and others use EREs.  See regex(7) for details or Regular Expressions from the POSIX/SUS standard.

Most RE dialects work this way: one line (usually) is read into some buffer.  Next the RE is matched against the text in the buffer.  Matched means any sequence of text in the buffer corresponds to the RE.  The whole buffer doesn't need to match.  For example, if the buffer text is “AABBCC”, then the REAB” matches (the second and third characters).

In a programming environment such as perl or an editor (sed), if the RE matches than some additional steps (such as modification of the buffer) may be done.  (With a tool such as grep, each line of input is put into a buffer and each matching line is just printed.)  Because of this it sometimes matters exactly what text in the buffer matched.  (For example, a tool such as sed may modify or delete the matched text from the buffer.)

If more than one match is possible an RE will match the one that starts earliest in the buffer (so the REAB” will match the first and second characters of the text “ABCABC”).  If more than one match starts on the same character the RE will match the longest one.  (See Greedy Matching below.)

Top-down explanation (from regex(7) man page):

An RE is one or more branches separated with “|” and matches text if any of the branches match the text.  A branch is one or pieces concatenated together, and matches if the first piece matches, then the next matches from the end of the first match, until all pieces have matched.  A piece is an atom optionally followed by a modifier:  “*”, “+”, “?”, or a bound (i.e., a range).  An atom is a single character RE or “(RE)”.

POSIX Regular Expression Syntax Chart With Examples
RE Meaning Example Matches
POSIX Regular Expressions
c The character “c J J
. Any character (with some tools, not a newline) . J   z   $
\c The character c literally (escaped) when c is a metacharacter (such as “.”);  Never end a RE with a single “\ Mr\. Ed Mr. Ed
seq A sequence of REs matches a string of text that matches each RE in turn t.n tan   tBn   t%n
[list]
[range]
Called a character class, any one character in the list;  You can only use a range if LC_COLLATE is set to POSIX or C t[aeio]n
T[A-D]N
[wW]ayne
tan   ten   tin   ton
TAN   TBN   TCN   TDN
wayne   Wayne
[^list]
[^range]
Any one character not in the list or range t[^ou0123456789]n
T[^0-9a-z]N
tan   ten   tin   (but not ton or t9n)
TAN   TBN   (but not TaN or T9N)
^RE Anchors RE to the beginning of a line (technically the ^ matches the null string at the beginning of the buffer, not embedded newlines) ^t.n
\^t.n
A^BC
tons (but not wanton)
^tons
nothing (may match A^BC with some dialects)
RE$ Like “^”, anchors RE to match at the end of the text t.n$
^ton$
^$
wanton (but not tons)
ton (on a line by itself)
  (an empty string)
RE* Zero or more of RE to*n tn   ton   toon   tooon
Basic Regular Expressions Only
\(RE\) (A subexpression)  Matches the same as RE.  Also copies the matched text into a numbered register (clipboard); see backreferences below \(to\)*n n   ton   toton   tototon
\n (A backreference)  Matches the same as the nth subexpression (counting opening parenthesis) \([BC]\)\1
\(..\)\1
\(f\(.\)\2\)\1
\(t[ao]n\)\1
\(rin\)\(-tin\)\2
BB   CC   (but not BC)
BABA   ABAB   (but not ABBA)
foofoo   feefee
tantan   tonton   (but not tanton)
rin-tin-tin
RE\{m\} Exactly m occurrences of RE to\{2\}n
.\{3\}
toon (but not ton or tooon)
ABC
RE\{m,\} m or more occurrences of RE to\{2,\}n toon   tooon   toooon
RE\{m,n\} Between m and n occurrences of RE to\{2,3\}n toon   tooon
Extended Regular Expressions Only
RE+ One or more occurrences of RE to+n ton   toon   tooon
RE? Zero or one occurences of RE to?n tn   ton
RE{m} Exactly m occurrences of RE to{2}n
.{3}
toon
ABC
RE{m,} m or more occurrences of RE to{2,}n toon   tooon   toooon
RE{m,n} Between m and n occurrences of RE to{2,3}n toon   tooon
RE|RE Either the left or right RE Mr\.|Ms\.|Dr\. Mr.   Ms.   Dr.
(RE) RE (the parens are used for grouping) (Mr\.|Ms\.) Smith Mr. Smith     Ms. Smith
Some Extensions
\<RE
RE\>

\bRE
RE\b
RE only at the beginning or end of a word. (While “\<” and “\>” are common “\b” is sometimes used instead.)  Like other anchors these match the null string at the boundries of words. \<ton
\bton

ton\>
ton\b
tons (but not wanton)


wanton (but not tons)

Additional Notes

Some characters used to express REs (meta-characters) are only special if they appear in a specific context.  For instance, the “^” is special only if it is the first character of some RE, the “$” only if the last.  The “*” is not special if the first character in an RE.  A “{” followed by a character other than a digit is not the beginning of a bound.  A backslash is always special, so it is illegal to end an RE with one.

Special (or meta-) characters lose their meaning if escaped, that is preceded with a backslash (“\”) character.  For POSIX these are the characters in the following list:  “^.[$()|*+?{\”.  (In some dialects the reverse is true; meta-characters only have special meaning when escaped!)  Some special characters (in POSIX, all) lose their special meaning when used in a character class.

To include a literal “]” in the list, make it the first character (following a possible “^”).  To include a literal “-”, make it the first character (following a possible “^”) or the last character.

Other metacharacters except “\” lose their meaning in the list.  The list can include one or more predefined character classes.  In POSIX these look like: [:name:], where name is one of:

 
alnum
alpha
blank (space or tab only)
cntrl
digit
graph (any printable character except a space)
lower
print (any printable character)
punct
space (any white-space character)
upper
xdigit (any hexidecimal digit)

[:xdigit:] is the same as [[:digit:]a-f], which is the same (in the C locale) as [0-9a-f].  In some RE dialects a backslash-character is used, for instance “\d” instead of “[:digit:]” for a digit.

Greedy Matching

In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string.  If the RE could match more than one substring starting at that point, it matches the longest.  (This is often called greedy matching.)  Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible.  Perl supports both greedy and non-greedy (also known as reluctant, minimal, ungreedy, or generous) modes.  For example:

$ echo '12345' | sed 's/\([0-9]*\)\([0-9]*\)/|\1|\2|/'
|12345||

The above example shows how the first subexpression matched the whole buffer, while the second matched nothing.