Overview of Text and Font Concepts for Programmers

Scripts and Languages

Originally written language was created to record spoken language.  A script (or writing system or character set) in this context is a means of doing so.  Note a single script can be used to represent many languages; for example the Latin (or Roman) script is the set of common symbols used to represent English, French, and many other languages.  The symbols in a script are called characters and can represent either sounds or ideas.  For example the characters of the Latin script are the 26 upper and lower case letters (plus some accent marks rarely used in English) that represent individual sounds (vowels and consonants), plus some other characters such as the digits that represent ideas.  Text is a sequence of characters/symbols from some script.  (Text can include characters from several scripts, and even represent sounds and ideas from different languages, but that is rare.)

Although you can represent any language with the Latin script (that is what commonly happens in chat rooms, email, and other on-line communications), it is preferable to be able to communicate using some language, with its native script (the script designed for that language).  But placing non-English material online a complex problem; computers were initially designed with English and mind.  Some of the different types of scripts include:

Encoding of a Script's Characters into Text

No matter what the writing system used, when text is stored on a computer it is stored as a series of numbers.  The translation from a character to a number (and back again) is called encoding.  Every unique character from every supported script is assigned a unique number, called its code point.  Early computers encoded text in one of two ways, EBCDIC or ASCII.  (Eventually ASCII became dominant.)  These encoding schemes assigned numbers to each upper and lower case letter, each digit, and some punctuation marks and other symbols.  (In addition some numbers are assigned special meaning to control the Teletype machine, the most popular early interactive input/output system.  Those numbers are called control characters.)

While fine for English speaking computer programmers, ASCII did not support other languages well, due to a lack of accented characters and math, business, and engineering symbols.  Strange rules were needed to represent accented letters, for example an 'E' character followed by an '~' character would be combined into a single Ë character.  This complicates text processing software, making it hard to compare, sort, or even calculate the length of a string of text characters.

As computers became internationalized ASCII wasn't enough.  Accented characters and other types of punctuation marks (and even a few additional letters and symbols) were needed for European languages.  Also, the (then) new dot-matrix printers were capable of printing a wider range of characters but were limited to ASCII.

(Adapted from The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.)

ASCII only defines code points from 0 to 127, but stores each one in a byte (8 binary digits).  Each byte can hold numbers (code points) from 0 to 255.  So the extra 128 numbers could be used to represent extra characters.  The problem is which code points to use for which character?  Everyone had a different answer, and there were many incompatible encoding schemes used!  Thus a text document written on one type of computer/ word processor (and some type of encoding) may or may not be readable on another computer or with another program, or printer.

In the 1980s, when the IBM-PC was produced, it had something that came to be known as the OEM character set which provided some accented characters for European languages, extra symbols, and a bunch of line drawing characters (horizontal and vertical lines arrows, etc.).  These characters were assigned numbers between 128 and 255, so they didn't affect ASCII and yet still fit into a single byte.  When PCs became widspread outside of the United States all kinds of different OEM character sets were dreamed up, which all used the top 128 characters differently.  For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs.  In cases such as with Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Eventually this OEM extra (beyond ASCII) characters became standardized in the ANSI standard.  In this standard numbers below 128 were ASCII.  The values aboive 128 were standardized into a large number of different code pages.  (The original IBM OEM characters became known as code page 437 (or CP-437 or even DOS characters).  So for example in Israel, DOS used a code page called 862 while Greek users used 737.  They were the same below 128 but different from 128 up.  The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few multilingual code pages that could do Esperanto and Galician on the same computer.  But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Microsoft supports many different encodings, but with no automatic conversion between them: Windows 98 uses CP-1252 or CP-437 or some other encoding, all known as DOS Text or ANSI or some such similar name.  In other countries completely different encodings are used, especially where the alphabet is different (e.g., Japan or Russia).

An expanded encoding was standardized (along with others) by the ISO and is today known as ISO Latin I (ISO-8859-1).  Many text files are compatible with this encoding and most printers support the characters defined.  Note the code points from 0 to 127 are the same for ASCII as for Latin I.  All the additional characters are assigned code points in the range 128 to 255.  Thus an ASCII file is also a legal Latin I file.

While adequate for most western alphabets (those based on the Latin script), ISO Latin I is still limited: Very few math, science, business, or other symbols are present which limits the use for magazines and other publishing.  The global Internet reaches beyond Europe to countries that use different alphabets or use a non-alphabetic script.

Ultimately a single, universally used by everyone, encoding is needed that supports characters for all languages (plus a rich set of symbols).  Various national organizations and businesses got together and defined such an encoding.  This list of characters is called the Unicode standard.  This encoding defines well over 100,000 code points, but the first 256 code points are the same as for Latin I.  This encoding was standardized by the ISO as ISO-10646.  The set of characters that are assigned code points is referred to as the Universal Character Set (or UCS).

ISO-10646 came first but didn't define as many code points as needed.  Since the year 2000 Unicode and ISO-10646 are developed in tandem so they match exactly.

Every so often characters from a new script are assigned code points and a new version is produced.  Currently (2009) Unicode 5.1 is the latest version.  Well over 1 million code points are reserved by Unicode and ISO-10646, but many are unassigned to allow room for future growth.

Besides assigning (encoding) characters from various scripts to code points, Unicode defines the properties of each: a letter, a digit, etc.

Rather than store or transmit each code point as a 4 byte number, code points are translated to code units using a method (confusingly) also known as an encoding.  (Technically this is called a character encoding scheme, while translating characters to code points is just an encoding scheme.  But everyone calls both encoding.)

Unicode always uses hexadecimal numbers and writes them with the prefix U+, so the numerical code point for the character A is written as U+0041.  (In many programming languages, Unicode values are written in strings using the prefix \u, as in \u0041.)

While a code point is just a number, code units can have a variable length, with the encoding (of Unicode code points to code units) of some characters using one byte, some using two or more bytes.  (The more common characters use one byte each, naturally.)

The most common scripts had their characters assigned code points from 0 to 64K (0xFFFF).  Each of these code points can be represented with exactly two bytes each.  This is a common enough situation that this set is known as the Basic Multi-lingual Plane (or BMP).  Supplementary characters are those with code points outside the BMP, in the range U+10000 to U+10FFFF.

If a document or web page uses fewer than 256 different characters it could encode the text (the code points) using one byte per code unit (and thus, each code unit is one code point, the number representing one character).  While the most commonly used code units are bytes, 16-bit (2 byte) or 32-bit (4-byte) integers can also be used for internal processing.  There are a number of such multi-byte character encoding schemes popular today.  UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.  But ASCII, Latin I, Big5 (Taiwanese), and other sub-sets of Unicode are still common.

Unicode files sometimes begin with a special sequence of bytes known as the byte order mark, or BOM.  This can be used to identify the endianness of text, indicate that the data is Unicode text, and even identify the character encoding scheme used.  (See BOM FAQ.)  Note that in general it isn't possible to determine the encoding scheme used; a BOM can help in many cases but will confuse some software (some Web Browsers will try to display the BOM for example.)

Some common (character) encoding schemes are:

UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value.  It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general text representation.

UTF-16 uses sequences of either one or two unsigned 16-bit code units to encode Unicode code points.  Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value.  Supplementary characters are encoded using two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF).  The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points.  This means software can tell for each individual code unit in a string of text whether it represents a one-unit character or whether it is the first or second unit of a two-unit character.  This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter A or be the second byte of a two-byte character.  (UFT-16 is a common native encoding, and is what is used by Java char type and for many other programming languages.)

The encoding of BMP characters (code points) only, as 2-byte code units is called UCS-2.  Until the year 2000 there were no supplementary code points defined in Unicode or ISO-10646.  According to Wikipedia, the People's Republic of China (PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support the GB18030 character encoding scheme.  This required computer systems intended for sale in the PRC to move beyond the BMP and UCS-2 wasn't sufficient anymore.

UCS-2 was later extended to permit one 2-byte code unit for characters in the BMP, and two 2-byte code units for the others.  This popular encoding is known as UTF-16 and is used internally by most operating systems today.  Microsoft calls this the Unicode encoding.  The two code units that comprise a code point with UCS-2 are called a surrogate pair. 

UTF-8 uses sequences of one to four bytes to encode Unicode code points.  Code points U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes.  UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set).  These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.

There is no way to mark a text files as using a certain encoding.  You just have to know the encoding used for any text.  To help with this, every operating system has a default or native encoding.  When reading text files (or standard input) you need to translate the native encoding to your program's internal encoding, typically UTF-16.  When writing output to a text file, a translation from the internal encoding to the native encoding is needed.  (Or things like the copyright symbol end up funny looking.)  Of course most software allows the program to specify the character encoding to use when reading or writing files, so you don't need to use the native encoding.  (I generally prefer UTF-8).  In addition XML files, web pages, and some other types of text documents can specify the encoding that was used in a header (special data at the top of the document).

For any computer you can easily enter the characters represented by keys on the keyboard.  But how do you enter accented letters or other characters that don't have keys of their own?  On Windows you can enter these using the ALT key and the numeric keypad digits.  Holding down the ALT key, you then type the code point number (in decimal) using the keypad.  For example, to enter é you would type ALT+130, and for ¿ you would type ALT+168.

Some applications have a short-cut way to enter characters with accents.  In Microsoft Word for example you can type CTRL+' (control+apostrophe), and then some letter such as e or E.  For ü you can type CTRL+: and then u.  (Word and other applications support many such shortcuts.)  Keys such as CTRL+' are sometimes called dead keys since nothing appears immediately when you type; the key appears dead.

You can also lookup the ALT number for any character using the Character Map accessory on Windows.  That also allows you to copy and then paste those characters.  In addition it will tell you the Unicode (hex) value for that character (which is the same as the ALT number, but that is in decimal and not hex).

Displaying characters: Fonts and Glyphs

The term font is vague and different people use it in different ways.  For now assume a font an array of tiny graphics created by an artist to share a particular look, and to map to letters, digits, and other characters from some script.  (A set of glyphs designed with a particular, consistent look, is known as a typeface.  For the real story search the Internet and visit Unicode.org.)  These graphics are called glyphs.  Glyphs are to characters what numerals are to numbers: a visual representation of an abstract concept (e.g. the letter A).  Many different glyphs can represent the same character; they just look different.  (The Latin script's Capital letter A might look like any of the following, which are all different glyphs for this one character: A, A, A, and A.  Even more confusing is the fact that different characters from different scripts may have similar or identical glyphs, for example O (Latin/Roman capital Oh), 0 (zero), Օ (Armenian Capital Oh), and (Circle).

As a final confusion, not all glyphs represent characters from any script.  I'm not talking here about ding-bats or other non-alphabetic characters.  As an effort to improve the appearance of written text some sequences of characters are represented by a single glyph.  For Latin script letters this is commonly done for the lowercase letter f when followed by certain other letters: waffles can be written as waffles, and file as file.

Type designers knew some things about how humans read text, and devised serif fonts which are letter shapes composed of lines (or strokes) of varying thickness and small extra bits on the ends of the strokes.  Text of such fonts is much easier to read and pretty much all books and magazines use serif fonts for body text.  (As should you!)

Text without the extra bits, and often drawn with lines of constant thickness, are called sans-serif (sans is French for without) and are used for attention-grabbing such as for headings and captions.

In the early computer era usually a single screen font was built into terminals.  Printers were based on daisy-wheel or line-printer technology, again that supported a single font.  These early computer screens and printers were limited to drawing each character in the same sized rectangular block.  Such fonts are called mono-spaced since all characters take up the same amount of horizontal space.  This leads to an un-even appearance as fat letters such as 'm' take the same space as skinny letters such as 'i'.  As the technology grew more sophisticated computers and printers became capable of displaying traditional fonts called proportional.  In these fonts the space between the characters is the same, giving the text an even appearance.  (Are you reading this in a mono-spaced or proportional font?  Look at this to decide:  MMMMiiii.)

So a font can be either proportional or mono-spaced.  It can have serifs or be sans-serif.  That's four possibilities, but fonts can have other attributes such as heaviness of the strokes (e.g., bold) or if the letters are straight (roman) or slanted (italics).  There are actually many attributes that define fonts.

OpenType fonts and CSS font properties use a system known as PANOSE to specify font characteristics.  For example, the weight of a font can be specified as one of these 9 values (from lightest to heaviest): 100, 200, 300, 400, 500, 600, 700, 800, 900.  400 usually corresponds to a font's normal weight, but there is no standard mapping of terms such as bold or demi to these numbers.

There are a couple of other issues you should know about.  One is that the size of fonts is usually measured in points, which are about 1/72 of an inch.  This unit worked well with early font technology since dot-matrix printers and computer monitors had 72 pixels to the inch.  (Horizontally anyway; monitors often use rectangular pixels that are taller than they are wide.)  A shortcut was taken for fonts where the font designers assumed 1 point = 1 pixel.  Today's monitors can use much smaller pixels and they are spaced closer together.  This is called the monitor's DPI (dots per inch).  This is why when you increase a monitor's resolution, most fonts come out looking tiny.  Some software is smart enough to correct for that.

Another issue is that different font files store the glyph data in different formats.  Your software must be able to read the format or it can't use the font.  Some common formats include TrueType, OpenType, and PostScript Type 1 fonts.  (There are other standards for font files such as FreeType; Web pages will get support in HTML 5 for a new font system called WOFF (Web Open Font Format).)

Microsoft created a set of fonts that it hoped would be widely distributed with all operating systems.  Known as the core web fonts these are included with Windows and Mac OS X, and they are freely downloadable for Linux.  The collection includes 10 typefaces:  the popular Verdana and Georgia, reworked versions of Times and Courier, Trebuchet MS, Andale Mono (has distinctive glyphs for commonly confused letters such as oh and zero), Impact, the Helvetica-esque Arial, the Webdings dingbat font, and the seldom-used Comic Sans.  Besides these the Java runtime includes the Lucida family of fonts, making them available on any system with Java installed.

These typefaces were specifically designed for screen use and have since become the most commonly used typefaces on the Web.  While quite serviceable, such a small set of fonts is limiting to designers.  Newer web browsers support downloadable fonts using CSS or JavaScript, such as those from openfontlibrary.fontly.org.

To take advantage of downloadable fonts, Google has released a large number of great-looking fonts you can easily use in any web page.  (I've used several in this document.)  See the Google Font Directory (See the getting started guide to see a good introduction on using these.)

Not all font files use Unicode to translate characters to glyphs.  (The ones that do are usually called Unicode Fonts.)  For these reasons you may have installed some font and find that some software can't use it, even if other software on your system can.

No readily available fonts include all Unicode characters and symbols.  About the most complete font available is Arial Unicode MS, which includes glyphs for over 50,000 Unicode characters.  (That's still just a tiny percent of the number of characters defined!)

There is a whole lot more to the story including ligatures, kerning, leading, and other fascinating (to me anyway) facts and history.  (Did you know that originally printers (human ones) traveled with cases containing little wooden (or soft metal) font blocks?  The capital letters were used much less often than the others and were stored in the top or upper part of the case while the rest were kept in the more convenient lower part of the case, and that's how we got the terms lowercase and uppercase letters.  Is that interesting or what?)

From spoken language to a written script of characters, to encoding, to fonts: a long journey.  But this is just the start!  Every language has rules for how to combine the characters in a script into meaningful words or symbols.  (For example, is text written left-to-right, then top-to-bottom, or right-to-left, top-to bottom, or top-to-bottom, then right-to left?) Every language has rules for capitalization, sort order of text, where lines can be broken, etc.

Every culture has rules for using the language, including such items as how to represent numbers, times, dates, names, addresses, and other data.  It's not sufficient to just use Unicode and UTF-8; to have a program that can be used by everyone world-wide, you must represent the strings of text (the messages) in each supported language.  You must determine the language and cultural aspects of the user and format dates, numbers, etc., correctly for them.

This is the subject of internationalization and localization, often abbreviated as I18N and L10N (because in English, the word Internationalization is the letter I, then 18 more letters, then an n; and similarly for localization.)  More information on this topic for Java programmers can be found at Locales and I18N Text (draft).