Regular Expression (RegEx or RE) Primer With Examples

 

©2006-2007 by Wayne Pollock, Tampa Florida USA.  All rights reserved.

Introduction

Regular expressions (or REs) are a way to concisely specify a group of text strings.  The Unix editor ed was about the first (Unix) program to provides REs.  Many later commands used this form of RE and their man pages would "see also ed(1)".  Over time folks wanted more expressive REs and new features were added.  The ed REs became known as basic REs or BREs, and the others became known as extended REs or EREs.

Suppose you needed to find a specific IPv4 address in the files under /etc?  This is easy to do; just specify the IP address as a string of text and do a search.  But, what if you didn’t know in advance which IP address you were looking for, only that you wanted to see all IP addresses in those files?  Even if you could you wouldn’t want to specify every possible IP address to some searching tool!  You need a way to specify all IP addresses in a compact form.  That is, you want to tell your searching tool to show anything that matches number.number.number.number.  This is the sort of task we use REs for.  You can specify a pattern (RE) for phone numbers, dates, credit-card numbers, email addresses, URLs, and so on.

The Good Enough Principle

With REs the concept of "good enough" applies.  Consider the pattern used above for IP addresses.  It will match any valid IP address, but also strings that look like 7654321.300.0.777 or 5.3.8.12.9.6 (possibly an SNMP OID).

To match only valid IPv4 addresses is possible but rarely worth the effort.  It is unlikely your search of /etc files will find such strings, and if a few turn up you can easily eye-ball them to determine if they are valid IP addresses.

It is possible to craft a more precise RE but in real life you only need an RE good enough for your purpose at hand.  If a few extra matches are caught you can usually deal with them.  (Of course, if making global search and replace commands, you will need to be more precise!)

An RE is a pattern, or template, against which strings can be matched.  Strings either match the pattern or they don’t.  If they do, parts of the matching string can be saved in named variables, which can be used later to either match more text, or to transform the matching string.

Pattern matching for text turns out to be one of the most useful and common operations to perform on files.  Over the years a large number of tools have been created that use REs, including all text editors, grep, sed, sort, and others.  The shell wildcards can be considered a type of RE.

While the idea of REs is standard, different tools may use slightly different syntaxes, or dialects.  Some of these tools also contain extensions that may be useful.  Perl’s REs are about the most complex and useful dialect, and are sometimes refered to as PREs.  (See man pages for perlre(1), also perlrequick(1), and perlretut(1).)

Eventually POSIX stepped in and standardized RE syntax, it is mostly compatible with the original ed REs but with many additions.  (POSIX doesn’t use terms like BRE and ERE).  However few of the older tools changed to use the new syntax.  See regex(7) for details.

Most RE dialects work this way: one line is read into some buffer.  Next the RE is matched against it.  In a programming environment such as perl or an editor (sed), if the RE matches than some additional steps (such as modification of the buffer) may be done.  With a tool such as grep a matching line is just printed.  Finally the cycle repeats.

Top-down explanation (from regex(7) man page):

An RE is one or more branches separated with | and matches text if any of the branches match the text.  A branch is one or pieces concatenated together, and matches if the first piece matches, then the next matches from the end of the first match, until all pieces have matched.  A piece is an atom optionally followed by a modifier*, +, ?, or a bound (i.e., a range).  An atom is a single character RE or (RE).

POSIX Regular Expression Syntax Chart With Examples
RE Meaning Example Matches
Basic Regular Expressions
c The character c J J
. Any character (except a newline) . J   z   $
\c The character c literally (escaped) when c is a metacharacter (such as .).  Never end a RE with a single \. Mr\. Ed Mr. Ed
seq A sequence of REs mataches a string of text that matches each RE in turn. t.n tan   tBn   t%n
[list]
[range]
Called a character class, any one character in the list.  You can only use a range if LC_COLLATE is set to POSIX or C. t[aeio]n
T[A-D]N
[wW]ayne
tan   ten   tin   ton
TAN   TBN   TCN   TDN
wayne   Wayne
[^list]
[^range]
Any one character not in the list or range. t[^ou0123456789]n
T[^0-9a-z]N
tan   ten   tin   (but not ton or t9n)
TAN   TBN   (but not TaN or T9N)
RE* Zero or more of RE to*n tn   ton   toon   tooon
^RE Anchors RE to the beginning of a line (technically the ^ matches the null string at the beginning of the buffer) ^t.n
\^t.n
2^2=4
tons (but not wanton)
^tons
2^2=4 (only special in front)
RE$ Like ^, anchors RE to match at the end of a line t.n$
^ton$
^$
wanton (but not tons)
ton (on a line by itself)
  (an empty line)
Extended Regular Expressions
RE+ One or more occurences of RE to+n ton   toon   tooon
RE? Zero or one occurences of RE to?n tn   ton
RE{m} Exactly m occurences of RE to{2}n toon
RE{m,} m or more occurences of RE to{2,}n toon   tooon   toooon
RE{m,n} Between m and n occurences of RE to{2,3}n toon   tooon
RE|RE Either the left or right RE Mr\.|Ms\. Mr.   Ms.
(RE) RE (the parens are used for grouping and back-references). (Mr\.|Ms\.) Smith Mr. Smith   Ms. Smith
(RE)RE\1 The \1 is a reference to the first group.  Groups are numbered by counting the opening parenthsis of groups. (t[ao]n)\1
(rin)(-tin)\2
tantan   tonton (but not tanton)
rin-tin-tin
\<RE
RE\>

\bRE
RE\b
RE only at the beginning or end of a word. (While \< and \> are common \b is sometimes used instead.)  Like other anchors these match the null string at the boundries of words.  (POSIX doesn't define word boundry matches for either BREs or EREs but these are common extensions with Gnu, Perl, and other utilities.) \<ton
\bton

ton\>
ton\b
tons (but not wanton)


wanton (but not tons)

Additional Notes

Some characters used to express REs (meta-characters) are only special if they appear in a specific context.  For instance, the ^ is special only if it is the first character of some RE, the $ only if the last.  The * is not special if the first character in an RE.  A { followed by a character other than a digit is not the beginning of a bound.  A backslash is always special, so it is illegal to end an RE with one.

Special (or meta-) characters lose their meaning if escaped, that is preceeded with a backslash (\) character.  For POSIX these are the characters in the following list:  ^.[$()|*+?{\.  (In some dialects the reverse is true; meta-characters only have special meaning when escaped!)  Some special characters (in POSIX, all) lose their special meaning when used in a character class.

To include a literal ] in the list make it the first character (following a possible ^).  To include a literal -, make it the first character (following a possible ^) or the last character.

Other metacharacters except \ lose their meaning in the list.  The list can include one or more predefined character classes.  In POSIX these look like: [:name:], where name is one of:

 
alnum
alpha
blank (space or tab only)
cntrl
digit
graph (any printable character except a space)
lower
print (any printable character)
punct
space (any white-space character)
upper
xdigit (any hexidecimal digit)

[:xdigit:] is the same as [[:digit:]abcd], which is the same (in the C locale) as [0-9abcd].  In some RE dialects a backslash-character is used, for instance \d instead of [:digit:] for a digit.

A back reference (\ followed by a non-zero decimal digit d) matches the same sequence of characters matched by the d-th parenthesized subexpression (numbering subexpressions by the positions of their opening parentheses, left to right).  For example: \([bc]\)\1 matches bb or cc but not bc.  (Note:  Back references are not defined in POSIX but are very commonly used anyway.)

In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string.  If the RE could match more than one substring starting at that point, it matches the longest.  (This is often called greedy matching.)  Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible.  Perl supports both greedy and non-greedy (also known as reluctant, minimal, ungreedy, or generous) modes.  For example:

$ echo '12345' | sed 's/\([0-9]*\)\([0-9]*\)/|\1|\2|/'
|12345||