Regular expressions (or REs or
RegExs or RegExps) are a way
to concisely specify a group of text strings.
The Unix editor ed
was about the first (Unix) program
to provide REs.
Many later commands used this form of RE
and their man pages would "see also
ed(1)
".
Over time folks wanted more expressive REs and new
features were
added.
The ed
REs became known as basic
REs or BREs, and the others
became known as extended
REs or EREs.
Suppose you needed to find a specific IPv4 address in
the files under /etc
?
This is easy to do; just specify the IP address as a
string of text and do a search.
But, what if you didn't know in advance which IP
address you were looking for,
only that you wanted to see all IP addresses
in those files?
Even if you could you wouldn't want to specify every possible
IP address to some searching tool!
You need a way to specify all IP addresses in a compact
form.
That is, you want to tell your searching tool to find anything that
matches
number.number.number.number
.
This is the sort of task we use REs for.
You can specify a pattern (RE) for phone numbers,
dates, credit-card numbers, email addresses, URLs,
and so on.
The Good Enough Principle
With REs the concept of "good enough" applies. Consider the pattern used above for IP addresses. It will match any valid IP address, but also strings that look like
7654321.300.0.777
or5.3.8.12.9.6
(possibly an SNMP OID).To match only valid IPv4 addresses is possible but rarely worth the effort. It is unlikely your search of
/etc
files will find such strings, and if a few turn up you can easily eye-ball them to determine if they are valid IP addresses.It is possible to craft a more precise RE but in real life you only need an RE good enough for your purpose at hand. If a few extra matches are caught you can usually deal with them. (Of course, if making global search and replace commands, you will need to be more precise!)
An RE is a pattern, or template, against which strings can be matched. Strings either match the pattern or they don't. If they do, parts of the matching string can be saved in named variables, which can be used later to either match more text, or to transform the matching string.
Pattern matching for text turns out to be one of the most useful and
common operations to perform on files.
Over the years a large number of tools have been created that use
REs, including all text editors, grep
,
sed
, sort
, and others.
The shell wildcards (“globs”) might be considered a type of
RE.
While the idea of REs is standard, different tools may
use slightly different syntaxes, or dialects.
Some of these tools also contain extensions that may be useful.
Perl's REs are about the most complex and useful
dialect, and are sometimes refered to as
PREs.
(See man pages for perlre(1)
, also
perlrequick(1)
, and perlretut(1)
.)
Eventually POSIX stepped in and standardized
RE syntax, it is mostly compatible with the original
ed
REs but with many additions.
POSIX uses terms like BRE and for “basic regular
expression” and ERE for “extended regular
expression”.
Some utilities use BREs and others use
EREs.
See regex(7)
for details or
Regular Expressions from the POSIX/SUS standard.
Most RE dialects work this way: one line (usually)
is read into some buffer.
Next the RE is matched against the text in the buffer.
Matched means any sequence of text in the buffer corresponds
to the RE.
The whole buffer doesn't need to match.
For example, if the buffer text is “AABBCC
”,
then the RE “AB
” matches
(the second and third characters).
In a programming environment such as perl
or an editor
(sed
), if the RE matches than some
additional steps (such as modification of the buffer) may be
done.
(With a tool such as grep
, each line of input is
put into a buffer and each matching line is just printed.)
Because of this it sometimes matters exactly what text in the
buffer matched.
(For example, a tool such as sed
may modify or delete
the matched text from the buffer.)
If more than one match is possible an RE will match
the one that starts earliest in the buffer (so the RE
“AB
” will match the first and second characters
of the text “ABCABC
”).
If more than one match starts on the same character the
RE will match the longest one.
(See Greedy Matching below.)
Top-down explanation (from
regex(7)
man page):An RE is one or more branches separated with “
|
” and matches text if any of the branches match the text. A branch is one or pieces concatenated together, and matches if the first piece matches, then the next matches from the end of the first match, until all pieces have matched. A piece is an atom optionally followed by a modifier: “*
”, “+
”, “?
”, or a bound (i.e., a range). An atom is a single character RE or “(RE)
”.
Some characters used to express REs
(meta-characters)
are only special if they appear in a specific context.
For instance, the “^
” is special only if it is
the first character of some RE, the
“$
” only if the last.
The “*
” is not special if the first character in
an RE.
A “{
” followed by a character other than a digit
is not the beginning of a bound.
A backslash is always special, so it is illegal to end an
RE with one.
Special (or meta-) characters lose their meaning
if escaped, that is preceded with a backslash
(“\
”) character.
For POSIX these are the characters in the following list:
“^.[$()|*+?{\
”.
(In some dialects the reverse is true; meta-characters only have
special meaning when escaped!)
Some special characters (in POSIX, all) lose their special meaning
when used in a character class.
To include a literal “]
” in the list,
make it the first character (following a possible
“^
”).
To include a literal “-
”, make it the first
character (following a possible “^
”)
or the last character.
Other metacharacters except “\
” lose their
meaning in the list.
The list can include one or more predefined
character classes.
In POSIX these look like: [:name:]
,
where name is one of:
alnum | |
alpha | |
blank |
(space or tab only) |
cntrl | |
digit | |
graph |
(any printable character except a space) |
lower | |
print |
(any printable character) |
punct | |
space |
(any white-space character) |
upper | |
xdigit |
(any hexidecimal digit) |
[:xdigit:]
is the same as [[:digit:]a-f]
,
which is the same (in the C
locale) as
[0-9a-f]
.
In some RE dialects a backslash-character is used,
for instance “\d
” instead of
“[:digit:]
” for a digit.
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. (This is often called greedy matching.) Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible. Perl supports both greedy and non-greedy (also known as reluctant, minimal, ungreedy, or generous) modes. For example:
$ echo '12345' | sed 's/\([0-9]*\)\([0-9]*\)/|\1|\2|/' |12345||
The above example shows how the first subexpression matched the whole buffer, while the second matched nothing.