AWK Overview

Lecture 9 — awk

Awk is a filter command (added in the late 1970s to Unix). It is named after its three co-inventors: Aho, Weinberger, and Kernighan. Awk proved so popular that it was extended with many useful feature five years later. Today Gnu provides gawk, and other dialects such as nawk (new awk) exists and are POSIX compliant. Gnu awk (gawk) is the most popular version and includes many very useful features missing from POSIX awk, including regex backreferences, TCP/IP and PostgreSQL DB commands, and many minor improvements.

The oldest version of awk dates from the 1970s. Awk version 2 was invented in the mid-1980s. POSIX awk is based on awk v2. However many systems provide multiple versions (for backward compatibility) and the version called “awk” on your system may be the original version (sometimes called oawk), version2/POSIX (sometimes called nawk), or the Gnu version (often also called gawk). There are other versions too.

Until Perl awk was the most powerful filter available. (And since Perl isn’t part of POSIX you may not find it on all systems; awk is always available.) Unlike most *nix filters that do a single task, awk (like sed) is a multi-tasker. You can use awk instead of a pipeline of many simpler utilities. Awk is well-suited for many tasks such as generating reports, validating data, managing small text databases (e.g., address book, calendar, rolodex), document preparation (e.g., producing an index), extract data, sort data, etc.

Among awk’s more interesting features is its ability to automatically break the line into fields, to do math, and to perform a full set of programming language tasks: if, loops, variables, functions, etc. These features make awk very useful for both one-liner scripts and for more complete programs. Awk is normally used to process data files and to produce reports from them or to re-arrange the files.

Awk is used this way: awk options 'script' [argument ...]. The script consists of one or more awk statements. Typically, the script is enclosed in single quotes (and may be several lines). For longer scripts you can also specify a file containing the script with the -f file option (which can be repeated; all the scripts are concatenated). The arguments are filenames just like for other filter commands.

Actually, the arguments may be either filenames or variable assignments. (So avoid using filenames with an equals-sign in them!) Such assignments are processed just before reading the following file(s). (See also the ‑v assignment option, which processes the assignment before anything else.)

Awk has a cycle like that of sed: A line (record) is read, then broken into words (fields). Next each awk command is applied, in order, to the line, if it matches. Then the cycle repeats.

Each awk statement has two parts:

pattern { action } semicolon_or_newline

A missing pattern means to apply the action to all lines. A pattern with no action means to print any matching lines (like grep).

A statement can appear on a single line and no space is needed between the pattern and action (or between statements). For readability, usually you put one statement per line, with spaces or tabs between the pattern and action. The open curly brace of an action must be on the same line as the statement’s pattern.

Statements must be separated by a newline or by a semicolon. (Gnu awk doesn’t require a semicolon or newline between statements; e.g.: awk '{print}{print}' instead of '{print};{print}').

An awk statement may also be a function definition. You can use those in actions. (It doesn’t matter where in the script you define these; the whole script is read before it is run.) Function definitions look like this:

function name(parameter_list) { statements }

Discuss style: one liners, multiple short statements, or long actions. Bad style:

awk '{print};function foo(){print "foo"};$0=="abc"{foo();next}' FILE

Awk variables don’t have to be declared and can hold either numbers or strings. Many variables are pre-defined, e.g., RS is the record separator (normally a newline). Awk also provides one-dimensional arrays but the index is a string. (See below.)

The value of a variable will be converted as needed to/from string, numeric, and Boolean values. To force a variable to be treated as a number, add 0 to it; to force it to be treated as a string, concatenate it with the null (empty) string. When converting a string to a number only a leading number is used. A string without a leading number converts to zero. true/false values: An expression that evaluates to the number zero or null string is considered false. Anything else means true. (Somewhat surprising is that "0" converts to true, not false!)

String literals are enclosed in double-quotes and can use common (C) backslash escapes: \n, \t, \\, etc.

Awk reads input up to the RS value (record separator). This is normally a newline but can be any single character. If RS is set to null (“RS=""”) then a blank line is the record separator. Also a newline becomes an additional field separator.

Next awk splits the line into fields using the value of FS as the field separator. FS may be a single char, a single space (then any run of white-space (tab, space, or newline) is a field separator, with leading and trailing white space ignored), or some ERE. E.g. “FS="[ ]"”:

$ echo ' 1 2 3 ' | awk '
BEGIN {FS=" " }
{ printf NF ": " $1
for (i=2; i<=NF; ++i) printf "|" $i
print ""
}'

3: 1|2|3

(With FS set as “FS="[ ]"” instead, the output is “7: |1||2||3|”.)

Once the line (record) is parsed, awk sets the variables $1, $2, ... to each field’s value. $0 is the whole line. NF is the number of fields. You can use $expression to refer to any field. Using $NF and $(NF-1) are common.

Changing NF will add/remove fields from the line. Assigning to a non-existing field adds fields to the line (and resets NF; skipped fields contain null strings).

Awk sets NR and FNR to the line (record) number; FNR starts over for each file. You can set OFS on ORS (output field/record separator) characters too.

After parsing the input, awk checks (in order) each statement in the script, to see if the pattern matches. The pattern is usually an ERE but can be any Boolean expression. If true then the action is executed. When the end of the script is reached, awk starts a fresh cycle.

AWK patterns may be one of the following:

BEGIN (All statements with this pattern are run before reading any data)

END (All such statements run in order after all data is read.)

Expressions (generally used to test field values):

/ERE/ (matches against whole record/line)

text ~ /ERE/
text !~ /ERE/
lhs == rhs (or !=, <, <=, >, >=, in, etc.)

pattern && pattern

pattern || pattern

pattern ? pattern : pattern

(pattern) for grouping

! pattern

pattern1, pattern2 an inclusive range of lines

The EREs are similar to those in egrep (except meta-chars don’t need a backslash).

Summary of pre-defined variables:

· RS, FS Record separator, field separator

· OFS, ORS “print x,y” is the same as “print x OFS y OFS ORS”

· NR, FNR Current record number; FNR restarts at 1 for each file

· FILENAME Current file being processed; stdin shows as “–”

· ARGC, ARGV, ENVIRON Used to access parameters and the environment; cmd line args (minus the script itself) are in ARGV[1]..ARGV[ARGC-1]

· RLENGTH, RSTART Set by match function; see below

· SUBSEP “ary[x,y]” is the same as “ary[x SUBSEP y]”; see arrays below

The most useful actions include print and printf:

print [comma separated list of expressions] - adds ORS at end, OFS between expressions

printf format, list of expressions that match place-holders in format:

{ printf "%-15s %03d\n", name, total }

(This will print the name in 15 columns, padding with blanks on the right, then the total, padding with zeros. The output will be nicely lined up in columns.) Any print or printf can be followed with a redirection:

print stuff | "some shell command"
and print stuff > "some file" (“>>” works too)

(Don’t forget the quotes!) Each print/printf maintains a separate stream, so you can redirect some output to one place, and other output to a different place.

Operators are similar to those in C (and Java, JavaScript, Perl, ...), with these additions:

· Any space between two strings means to concatenate those strings.

· A “^” means exponentiation.

· “$numeric_exp” means a references to a field.

· A “~” (“!~”) is used to (not) match, e.g., var [!]~ /RE/.

· The expression “val in array” is true if array[val] has been defined.

It is easy to forget that white-space is the concatenation operator, which has low priority. So:

print "result: " 1 + 2 3 + 4 # prints result: 37

Arrays — Awk supports one-dimensional arrays that use strings for the subscripts. (These are often called associative arrays.) You also use “in” with a special form of the for loop to iterate over each item in the array (var is set to each index):

for ( var in array ) statement

To simulate a 2-d array you used to have to use string concatenation, something like “array[i "," j]”. Modern/POSIX awk allows us to use “array[i, j]”, which is the same as “array[i SUBSEP j]”. (Show two-d-arrays.awk.)

When using the in operator with these pseudo-2-d arrays, use parenthesis around the subscripts, e.g., “(x,y) in ary”.

To remove an element, use delete array[index]. (The statement “delete array” erases the whole array, but currently, only in gawk. For POSIX prior to 2012, use “split("", ARRAY)” to delete an array.)

You can also use if, while, do statement while, for(init;test;incr), for (var in array), break, continue, exit, block-statements with { and }, next, nextfile, and getline [var].

Using shell variables in awk: You can use the ENVIRON array to access awk’s environment. You can access command line arguments with the ARGV array (indexed from 1 to ARGC-1). A more common solution is to create a shell (not awk) script that runs awk '... '"$1"' ...' to access shell’s $1. Complex quoting can result! (“awk '$1 == "'"$1"'" {...}'”).

In addition, there are a slew of standard functions for math (int, exp, log, sqrt, rand, etc.) and for string manipulation (see the man page for details):

int(n) Returns the integer part of num.

srand([num]), rand() Generates random numbers; the only function in POSIX that can do so (without using C). The srand function sets the random seed for rand. It defaults to using the current time. Each call to rand produces the next pseudo-random number in the sequence, in the range [0..1). (This means including zero but excluding 1.) Starting with the same seed value, you always get the same sequence.

Often you need random integers in a given range. To obtain a random integer from [0..n) use something like this:

srand(); num = int(rand() * n)

To have the numbers start at (say) 1 instead of zero, just add 1.

On Linux, you can use the shuf command for random numbers, e.g., “shuf -i 1-10”.

split(string, array [,ERE]) Splits the string into fields, storing each in the array. If you don’t specify an ERE to use, the current value of FS is used instead to split the fields. Note the array indexes start with 1, not 0.

sub(ERE, repl [,string]) Searches through string for the first occurrence of the ERE and replacing it with the text repl. If the string is omitted $0 is used. This works like the sed command s/ERE/repl/, except no back-references. (Can use an “&” in repl to mean the text that matched.)

gsub(ERE, repl [,string]) The same as sub, but every occurrence is replaced.

length([string]) The length in characters (not bytes!) of the string, or $0 if string is omitted.

index(string, substring) Returns the position in string of substring, or zero if it doesn’t occur.

match(string, ERE) Returns the position of ERE in string, or zero if it doesn’t occur. This function also sets the variables RSTART and RLENGTH to the starting position in string that matched the ERE, and the length in characters of the matched text.

substr(string, start [, length]) Returns the substring of string starting at position start, of length characters (if there are that many). If length is omitted, the rest of the string is returned. Note the first characters has position 1, and not 0. (It is common to use RSTART and RLENGTH with substr, right after using match(). This is useful since POSIX awk doesn’t support backreferences.)

getline This function has several forms. It reads the next record of data, or reads from a specified file, or from the output of a pipeline, into either $0 (resetting NF etc.) or into some specified variable.

tolower(string), toupper(string), sprintf(format, args) These have the obvious meanings.

close(stream) Close is used when printing to a file or pipe, or using getline from a file or pipe. (E.g., if your code includes > "foo" or |"cmd", you would use close("foo") or close("cmd")).

Gnu awk has the useful non-standard asort (sort an array by values), asorti (sort an array’s indexes), and gensub (like gsub but with back-references; note the double backslash in this example):

$ echo abc |awk '
{ s = gensub(/.*(b).*/, "x\\1y", "g"); print s }'
xby

Or to swap “<a,b>” to “<b,a>”:

$ echo '<a,b>' |awk '
{s=gensub(/<(.*),(.*)>/,"<\\2,\\1>", 1); $0=s};1'

<b,a>

The match function also supports capture groups in Gnu awk.

Note that unlike sub and gsub, gensub doesn’t modify the string (optional 4th argument, $0 by default) but returns the result. Also note the need to double up the backslashes in the replacement string (do that with sub and gsub too).

Examples are the best way to learn. Awk one-liners are very handy to more easily solve tasks that otherwise require long pipelines of sed, cut, sort, etc. Try to convert your other filter pipelines to awk. Examples:

Explain this output:

$ awk 'BEGIN{print length( 0123 )}'
2

$ awk 'BEGIN{print length( 123 )}'
3

$ awk 'BEGIN{print length( "0123" )}'
4

$ echo 0123 | wc -c
5

Display the last X characters from every line:

awk '{printf("%s\n", substr($0,length($0)-X))}' file

Print and sort the login names of all users (from /etc/passwd):

BEGIN { FS = ":" }; { print $1 | "sort" }

Count lines in a file: END { print NR }

Common AWK idiom: Work like sed (change lines that match, and output all):
awk ' /.../ { ... };1'
(The “1” is a pattern that matches anything, and the default action of print is done. A non-zero integer expression for the pattern means “true”.)

Precede each line by its number in the file: { print FNR, $0 }

Display only the options of any command from its man page (bin/showoptions):

man -s 1 "$*" |col -bx |awk '/^[ ]*-/,/^$/' |less

Print the first and fifth fields from /etc/passwd in the opposite order:

BEGIN { FS = ":" }; { print $5, $1 }

Summarize data. (Show interrupt_cnt.awk.) Note POSIX awk doesn’t have any built-in sort functions. (Gnu awk does.) The common solution is to run the output of awk through a pipe into the sort utility. It is also possible to define a sort function in awk in about 15 lines:

function qsort (A, left, right, i, last) {

if (left >= right) # do nothing if array size < 2

return

swap(A, left, left + int((right-left+1)*rand()))

last = left # A[left] is now partition element

for (i = left+1; i <= right; i++)

if (A[i] < A[left]) swap(A, ++last, i)

swap(A, left, last)

qsort(A, left, last-1)

qsort(A, last+1, right)

}

function swap (A, i, j, t) {

t = A[i]; A[i] = A[j]; A[j] = t

}

Process INI files with awk: validate, display sorted list of name=values. (Show cfg.awk.)

Print the last record of a file where each record is a block of lines that starts with “START_STRING”:

tac file |awk '1;/START_STRING/{exit}' |tac

Address book lookup script, with fields defined for first, last, phone, etc.

Users logged in (but not) from HCC. (Process who output).

Reformat the output of some command. (Modify df output to skip %used column, and any non-disk lines. Show df.awk and df2.awk.)

Print the IP address for a given host. Parse the output of nslookup:

nslookup foo.com |awk '$4~/^foo.com/{print $6}' RS=''

(Parsing the output of “host -t a wpollock.com” might be easier.)

Process Apache error log to find the top 10 “File does not exist” files and the web page referrer (with the bad link): (Convert this sed script to awk)

$ sed -e \
'/File does not exist/s/^.* exist: $.*$$/\1/' \
-e '/, referer/s/^$.*$, referer.*$/\1/' error_log \
|sort |uniq -c |sort -nr |head

Define and use a function to clear an array (See function_demo.awk):

{ list[$1] = $2 }

END { for ( name in list )
           print name, list[name] | "sort"
        clear( list )
      }

  function clear( array )
  { for ( indx in array ) delete array[indx]
  }

There are some incompatibilities between Gnu awk and POSIX awk. If necessary, you can make “awk” an alias for “gawk --posix”, and use gawk when you want to use the Gnu extensions. Gnu awk also has an option “--lint” to check your script and display warnings for questionable code.

Print a range of lines from a syslog log file: To display all log entries related to xinetd between two times: /date1/,/date2/ {if ($5 ~ /^xinetd/) print}

Using a pair of patterns to select a range of lines only works if you know the exact dates that will appear in the file. Otherwise you need a more complex solution. (Show show-range.awk <show-range.dat.)

From the output of elfdump), print only the lines starting “Section Header[]” if the sections do not have “SHF_WRITE” in them:

awk 'BEGIN {RS=""; ORS="\n\n"} !/SHF_WRITE/'

Sort a file, except for the header:

awk 'NR==1; NR>1{print|"sort"}; END{close("sort")}'

(Or use the gawk sorting features.)

Given a text file with some sections (sets of lines) delimited with, say, HEAD and FOOT, output the whole file with the delimited lines having '#' in front of them.

awk '/FOOT/{s="";next};{print s $0};/HEAD/{s="#"}' file

Print the month number, given a month name:

awk -v m=Apr '
BEGIN{print (match("JanFebMarAprJunJulAugSepOctNovDec", m)+2)/3}'

That doesn’t fail well with an invalid month. Another approach is to build an array, using month names as indexes:

awk -v m=Apr 'BEGIN { month["Jan"]=1; month["Feb"]=2;
month["Mar"]=3; month["Apr"]=4; month["May"]=5;
month["Jun"]=6; month["Jul"]=7; month["Aug"]=8;
month["Sep"]=9; month["Oct"]=10; month["Nov"]=11;
month["Dec"]=12;
print month[m] } '

When to use awk or sed or plain old shell or something else?

Sometimes you can solve a task using different tools. There are times when one solution is “better” than another. But most of the time, it won’t matter!

Any tool could be used when you only need to process a small amount of data, say hundreds to a few thousand lines. If you have more than that, the performance of the shell may become noticeable. This is because the shell generally processes text using code such as:

while read LINE
do : something with line
done < file

The something with line above often involves external (non-built-in) utilities to be invoked, such as tr or even a sub-shell. Creating a new process for every line can cause noticeable performance loss. This is why script writers prefer using shell built-ins, even if it makes the script uglier to read.

Complex tasks, even for small amounts of data, may be best written in awk or sed. This is because shell scripts can be tricky (many “dark corners”) and the best features are non-portable. Portability matters (when you organization moves from Solaris to HP_UX, AIX, BSD, or Linux). A portable script is also less likely to break when handed unexpected data.

sed is best for simple, line at a time transformations (using “s”), over a small range of lines. One reason is sed’s BREs support back references.

awk supports extended regular expressions but not backreferences. (Note that Gnu awk does has some support for that.) While you can achieve the same effect using several awk statements, a single sed script with an RE using back references is simpler. In addition, awk does a lot of parsing for every line of input (field splitting). This is handy when you need it but does make awk slower than sed, when you don’t need that feature.

Perl is much more powerful than awk and has better regular expressions then any standard utility, but Perl is slower (in most cases) than awk or sed, is non-standard, and is much more difficult to master. Still, there are times when Perl’s extra features are handy, and you don’t need to master the whole Perl language to use it for powerful one-liners. Perl is discussed next. (There are other languages too, such as Python or Ruby. They have similar features and issues as with Perl.)

Here’s an example task, solved with shell, sed, and awk: Write a script that displays the previous and following line to any line containing “$1”:

Shell (show “context.sh en afile” also with “three” or “ne”):

TEXT="$1"
shift
SHOWNEXT=0
PREVIOUS=
cat "$@" | while IFS= read -r LINE; do
[ "$SHOWNEXT" = 1 ] && printf '%s\n---\n' "$LINE"
SHOWNEXT=0
case $LINE in
    *${TEXT}*) printf '%s\n%s\n' "$PREVIOUS" "$LINE"
               SHOWNEXT=1
               ;;
    esac
    PREVIOUS="$LINE"
done

sed (show “context.sed en afile”):

TEXT="$1"
shift

sed -n -e '/'"$TEXT"'/{x;1!p;g;$!N;p;$!a\
---
;D;}' -e 'h' -- "$@"

awk (show “context.awk en afile”):

BEGIN { text = ARGV[1]; delete ARGV[1]; shownext = 0 }
shownext { print; shownext = 0; print "---" }
$0 ~ text { print previous; print; shownext = 1 }
{ previous = $0 }

To me it is obvious that awk is the way to go even if there wasn’t a lot of data to be processed. If portability isn’t a concern and you have Gnu grep, it has an option for this task. (Note the solutions aren’t 100% as they will sometimes display an extra “---” after the last match. Fixing that is left as an exercise for the reader.)

[Posted on comp.unix.shell by Joe Young <j.joeyoung@gmail.com> on 10/28/2010, as “Find Almost Uniq Lines in Sales Report”]

“I have a sales report. And it prints several hundreds of lines like this:

1 Sales restaurant from 2010091009 and period is: 009 open and the store_id 04 20100910
2 Sales restaurant from 2010091009 and period is: 009 checking and the store_id 04 20100910
3 Sales takeaway from 2010091009 and period is: 009 filling and the store_id 04 20100910
4 Sales takeaway from 2010091009 and period is: 009 open and the store_id 04 20100910
5 Sales takeaway from 2010091009 and period is: 009 open and the store_id 04 20100910
6 Sales takeaway from 2010091009 and period is: 009 open and the store_id 04 20100910
7 Sales takeaway from 2010091007 and period is: 007 open and the store_id 10 20100910

Fields are: record-number, store-type [text:from ...], sales-number [text:and period is:...], period-number, store-status [text before: and the store_id ...], store_id, and the date.

Problem: many (but by no means all) lines are almost total duplicates, with the exception of their record number, which is always unique. So I need to [discard duplicates]. I can’t just strip the numbers off, sort and uniq the remainder, and then make new record numbers. (The record number needs to be authentic, but it doesn’t matter which record number is used from the duplicate records.)

Solution 1:

awk '!a[substr($0,index($0, " ")+1)]++' file

This is another variant of a popular awk idiom of removing duplicates. The base for that code is:

awk '!a[key]++'

which is equivalent to the (clearer):

awk 'a[key] == 0 { print } { a[key]++}'

The above prints all the lines having a key that has not been seen before. “key” is a generic term; in practice you can use the whole line, a certain field, or in general any expression. Here the “key” is the part of the line following the first space. That’s the substring selected by substr($0,index($0, " ")+1). Removing the “+1” would include a leading space in the key.

Solution 2:

sort -uk2 file

This sorts uniquely on the full line from the 2nd field on, ignoring the first field (the record number we want to ignore).