Brief Overview of Locales for Unix and Linux

A locale is a definition of language (and encoding, e.g. UTF-8), time, currency, and other number formats, that vary by language and geographical region.  Related formats are grouped into categories*nix systems include a number of environment variables (one per category) you can use to pick these data formats, by specifying a locale for each.

The settings in a locale reflect a language's and geographic region's (i.e., country's or territory's) cultural rules for formatting data.  A locale name looks like lang[_region][.encoding][@variant].  For example en_US.utf8.  Only the lang part is required.  POSIX (or C) locales are always defined, but others may or may not be defined (installed) on any given system.  A locale can also be an absolute pathname to a file produced by the localedef utility.

The POSIX categories and the environment variables for each are:

LC_CTYPE
Character classification (letters, digits, ...) and case conversion.
LC_COLLATE
Collation (sorting) order.
LC_MONETARY
Monetary formatting.
LC_NUMERIC
Numeric, non-monetary formatting.
LC_TIME
Date and time formats (but not time zones).
LC_MESSAGES
Formats of informative and diagnostic messages and interactive responses.  (Related to NLSPATH.)

(Additional categories such as LC_ADDRESS or LC_PAPER may be available on some systems.)  If some LC_* variable is not set, the value of LANG is used to define its locale.  If LC_ALL is set, that value over-rides any other LC_* and LANG settings.

You can see the current values used for each category, and details on each installed locale, by using the locale command.

To portably set your locale, it is best to set the LC_ALL environment variable to C (or POSIX).  Setting only (for example) LC_COLLATE has two problems:  it is ineffective if LC_ALL is also set, and it has undefined behavior if LC_CTYPE (or LANG if LC_CTYPE is unset) is set to an incompatible value.  For example, you get undefined behavior if LC_CTYPE is ja_JP.PCK and LC_COLLATE is en_US.UTF-8.

Most shell scripts probably should set LC_ALL to POSIX at the top of the script.  (You may want to set TZ to UTC0 as well, especially for utilities that record a date in the current timezone such as diff and tar.)

The standard utility iconv can be used to convert between (compatible) text encodings.  Use iconv -l to list all available encodings on your system.