(POSIX) Shell Command Line Processing Outline

Whether reading input from a file (a script) or from the keyboard, the processing steps are the same.  A few extra steps are done on most modern shells when the shell is run interactively.  These include displaying prompts, history expansions, and line editing; these features don't apply to scripts.  Also, the shell that is run first when a user logs in is call a login shell.  A login shell runs a login script (other shells don't) before printing a shell prompt.  Finally some shells have a restricted mode, which disables a number of features.  The idea is to only provide a restricted shell for guest users and for system accounts that need to run shell scripts.

Most shells accepts command line arguments that change their behavior to a strictly POSIX compliant mode (or restricted, login, or interactive mode).

POSIX shells process input by following the POSIX Shell Command Language Introduction steps.  Modern shells such as Bash, Korn shell, Z shell, etc., add several extra steps to support their extra features.  The shell does a number of distinct chores, in this order:

  1. (Input)  The shell reads its input either from a file, from the -c command option, or from stdin (standard input).  Note if the first line of a file of shell commands starts with the characters #!, the results are unspecified (because some other utility is then responsible for reading the file).
  2. (Tokenizing)  The shell breaks the input into tokens: words and operators separated by spaces (or some meta-characters).  This is a complex step!  One important part is quoting; this is used to remove the special meaning of meta-characters or words to the shell.  Quoting can be used to preserve the literal meaning of the following special characters (and prevent reserved words from being recognized as such):
    
      | & ; <  > (  ) $ ` \ " ' <space> <tab> <newline>
    

    The following characters need to be quoted under certain circumstances:

    
      *  ?  [  #  ~  =  %
    

    Ignoring quoting for the moment, the tokenizing can be explained this way:  The shell starts reading input one character at a time until a token is recognized.  The special characters listed above are tokens by themselves.  Words are just word tokens; however at the start of processing a command some words are recognized as keywords such as if, while, fi, and so on.  These same words are not keywords if seen elsewhere on the command line, or if quoted.  This explains why you need a semicolon or newline in front of then and fi in this example:

    if test -r foo; then ...; fi
    

    Word tokens are separated by white-space.  (The white-space separators are not tokens).  Finally there is the maximum munch rule, used to resolve the ambiguous case when some sequence of characters may be interpted as either a single token or two tokens.  Consider these examples:

    & &                2 tokens
    &&                 1 token
    date&&cal          3 tokens
    echo hello 2> foo  4 tokens: echo, hello, 2>, foo
    echo hello 2>foo   the same 4 tokens
    echo hello2>foo    4 tokens: echo, hello2, >, foo
    

    Quoted characters are always word tokens, never operators or keywords.  The various quoting mechanisms are the escape character, single-quotes, and double-quotes: 'x', "x" ('$', '\', and '`' still special), and \x (when x is a meta-character), which is sometimes called escaping.

    From the SUS:  The backslash shall retain its special meaning as an escape character [inside double-quotes] only when followed by one of the following [5] characters [...]:  $ ` " \ newline

    (Note '!' acts weirdly (history expansion) in Bash inside of double quotes from the command line, but not from in a shell script as history is turned off.)

    Another part of tokenizing is line joining.  If a line ends with (an unquoted) backslash, the \newline is skipped and the next line is joined to the current one.  The \newline doesn't separate tokens!  (Line joining applies inside double-quotes too.)

    The end of a command is marked by a control operator token, one of:

    &  &&  (  )  ;  ;;  newline  |  ||  (and EOF)
    

    These are assigned a precedence, to resolve the ambiguity of (say):

    false && echo one ; echo two # What's the output?
    

    (You can use command grouping to make this do what you want.)

  3. Command parsing is next.  The shell examines all the tokens read and breaks them up into one or more commands.  The shell recognizes several kinds of commands:
    • simple commands (a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and more redirections, terminated by a control operator such as a ;, a newline, etc.),
    • pipelines (one or more commands separated by |),
    • AND-OR lists (one or more pipelines separated by either && or ||),
    • lists (a sequence of one or more AND-OR lists separated by ; or &),
    • compound commands (a grouped command, loop, if statement, case statement, etc.), and
    • function definitions.

    (Technically speaking, there are simple commands, function definitions, and everything else is a compound command.)

    Note that only simple commands can be proceeded by variable assignments.  These are put into the environment of that command only.  So if FOO=one then:

    FOO=two echo $FOO  # prints "one"!
    FOO=two sh -c 'echo $FOO'  # prints "two"!
    FOO=two eval echo \$FOO  # prints "two"!
    FOO=two for i in 1 2 do; echo $i; done  # error!
    FOO=two (echo $foo)  # error!
    (FOO=two env) |grep FOO  # prints "FOO=two"!
    (FOO=two w | env) |grep FOO  # prints "FOO=one"!
    

    Some simple commands are shell built-in commands, but these are in no way different from other, non-built-in utilities (the term isn't even used in the standard).  Any utility may be built-in (test, echo, and printf are common examples) or not.  But some other simple commands are called special built-in commands, which must be built in.

    There are two things that make special built-ins different from other utilities: the shell exits when a special builtin encounters certain (syntax) errors, and variable assignments preceding a special built-in persist after the builtin completes.  (Using the command command with a special built-in command suppresses both of those.)  The special builtins are: break, colon (:), continue, dot (.), eval, exec, exit, export, readonly, return, set, shift, times, trap, and unset.

    At this point the shell will process each simple command separately, in order.  If the shell hasn't read in enough input to find the end of the command, it goes back to the preceding (tokenizing) step and reads in more input.  For each simple command the following is done:

    1. The words that are recognized as either variable assignments (name=value) or redirections (“[nop word”) are removed and saved for later processing in steps c and d.
    2. The remaining words (that are not variable assignments or redirections) are expanded as described below in steps 4 through 7.  If any fields remain following their expansion, the first field shall be considered the command name and remaining fields are the arguments for the command.
    3. Redirections (found in step a) are performed next.  If the redirection operator (“op”) is “<<” or “<<–” (a here doc), the word that follows op has quote removal done; it is unspecified whether any other expansions occur.  For the other redirection operators, the word that follows the op is subject to tilde expansion, parameter expansion, command substitution, arithmetic expansion, and quote removal.
    4. Each variable assignment (found in step a) is expanded for tilde expansion, parameter expansion, command substitution, arithmetic expansion, and quote removal prior to assigning the value.

    Note that field splitting is never done on variable assignments!  As long as the name=value is recognized as a single word in step 2, any expansions done on the value will result in a single word.  Consider:

    foo=*     # no quotes needed
    foo='x date'
    bar=$foo  # no quotes needed
    

    The order of steps c and d may be reversed when processing special built-ins.

    What is the output of the following?

    x=y echo $x
    x=y x=z sh -c 'echo $x'
    x=y : | x=z echo $x
    x=y : | x=z sh -c 'echo $x'
    env x=y echo $x
    
  4. (Expansions)  The shell performs several types of expansions on different parts of each command, resulting in a list of fields (or words), some to be treated as a command and the rest as arguments and/or pathnames (the command's parameter list).  Most expansions that occur within a single word expand to a single field.  (It is only pathname expansion that can create multiple fields from a single word.)  The single exception to this rule is the expansion of the special parameter @ within double-quotes.

    The expansions are done in this order (and are discussed in detail later):

    1. Alias expansion is done if the command name word of a simple command is determined to be an unquoted, valid alias name. 
    2. Tilde expansion  (Expands words of the form ~username to the absolute pathname of the home directory for username.  If the word is a bare tilde (~) it expands to the absolute pathname of the current user's home directory.)
    3. Parameter expansion  (Expands words that start with a dollar sign ($) with the results of a lookup in the environment.  Note if no such parameter is found this expands to nothing.  also note that words of the form $(stuff) are treated specially.)
    4. Command substitution  (Expands $(embedded command line) and `embedded command line` by recursively processing and running the embedded command, and replacing the embedded command line with the standard output of the command.)
    5. Arithmetic expansion  (Expands words of the form $((expression)).)
  5. Field splitting is performed on the fields generated by the previous step.  This is because what was one word before expansions may result in multiple words after expansions, for example:
    FILES="file1 file2"
    ls $FILES  # results in: ls file1 file2
    

    But since tokenizing was already done, file1 file2 will be one word and cause an error!  So the expanded tokens (which are called fields) need to be split into separate words.  (Demo: IFS=; FILES='foo bar'; ls $FILES)

    Running: IFS=: echo $PATH doesn't work as you might expect; PATH is expanded before the new IFS setting takes effect (in echo's environment.  You can use eval for this:  IFS=: eval echo \$PATH

    Field splitting is controlled by the parameter IFS.  If set to null (i.e. IFS="") no field splitting is done.  Otherwise the shell treats each character of the parameter IFS as a white space character (or delimiter).  The results of unquoted expansions are split into fields (words) separated by runs of such white space.  Any leading or trailing white space is skipped as well.

    If IFS is unset then the default delimiters are <space>, <tab>, and <newline>.  For example, the input:

    <newline><space><tab>foo<tab><tab>bar<space>
    

    yields two fields, foo and bar.

    Brace Expansions: While not part of POSIX most modern shells also do brace expansion.  Brace expansion generates arbitrary strings, not filenames.  There are two forms.  A word with the form pre{list}post will expand into one word for each item in list.  For example:

    vi file{1,2,3,}.txt
    

    will expand to the line:

    vi file1.txt file2.txt file3.txt file.txt
    

    Note pre may not be $.  The list must include at least one unquoted comma.

    The second form uses a range instead of a list between the braces:

    vi file{1..10}.txt
    

    A range can be integers or characters (e.g. ls /dev/sd{a..g}).

    Depending on the shell you can or cannot use shell special characters in list without quoting them; it depends on when the shell does the brace expansions.  Bash does brace expansion between steps 4a and 4b; Zsh after step 4e (and field-splitting isn't done!); and ksh between steps 5 and 6.

    Ksh supports more elaborate forms as well.  Also, Zsh only permits a range of numbers.

  6. Pathname (or wildcard) expansion is done next (unless the set option -f is in effect).  Note that unlike the other expansions, this can produce multiple words.  Also, this is the only expansion done after field-splitting.  So if “*.txt” expands out to several filenames, each name is one word even if it contains spaces.
  7. Quote removal is performed next.  This simple step just removes the shell quoting characters, as they are no longer needed.  (If the complete expansion for a word results in an empty field, that empty field is deleted from the expanded command unless the original word contained single-quote or double-quote characters.)
  8. I/O redirection is performed.  Then any redirection operators and their operands are removed from the parameter list.
  9. The shell is (finally!) ready to execute the command (which may be a function, built-in, executable file, or script).  It sets up the environment first, giving the names of the arguments as positional parameters numbered 1 to n, and the name of the command (or the name of the script) as the positional parameter numbered 0.  The environment for the new command is initialized from the shell's current environment, modified by any I/O redirections and variable assignments made.
  10. After starting the command the shell optionally waits for the command to complete (unless it was started in the background) and collects the exit status, setting the variable "$?" to that value.

Note the history mechanism works at some point (via the readline library on Linux), but is not part of POSIX.  Since some characters have special meaning to readline (such as '^' and '!'), these may appear to be meta-characters sometimes and not meta-characters at other times.  It depends on the shell in use and its history configuration, if readline is used, and your ~/.inputrc file (which is used to configure readline).  Apparently readline knows about single-quote and backslash quoting, but doesn't recognize double-quotes.