Originally written language was created to record spoken language. A script (or writing system or character set) in this context is a means of doing so. Note a single script can be used to represent many languages; for example the Latin (or Roman) script is the set of common symbols used to represent English, French, and many other languages. The symbols in a script are called characters and can represent either sounds or ideas. For example the characters of the Latin script are the 26 upper and lower case letters (plus some accent marks rarely used in English) that represent individual sounds (vowels and consonants), plus some other characters such as the digits that represent ideas. Text is a sequence of characters/symbols from some script. (Text can include characters from several scripts, and even represent sounds and ideas from different languages, but that is rare.)
Although you can represent any language with the Latin script (that is what commonly happens in chat rooms, email, and other on-line communications), it is preferable to be able to communicate using some language, with its native script (the script designed for that language). But placing non-English material online a complex problem; computers were initially designed with English and mind. Some of the different types of scripts include:
vowel marksare used to indicate which vowels follow the consonant are called
syllabic alphabets. This can also be called abjads or consonant alphabets. These are true alphabets in the sense that consonants and vowels are independently written, but the vowels and consonants combine to form complex written characters. Examples include Hindi (Devanagari), Thai, Tibetan, Korean Hangul, Hebrew, and many others.
1can be pronounced as
onein English,
unoin Spanish,
unin French,
ein, and so on.) Japanese Kanji (based on Chinese) is one example. The majority of characters in the Chinese script are semanto-phonetic compounds: they include a semantic element, which represents or hints at their meaning, and a phonetic element, which shows or hints at their pronunciation.
No matter what the writing system used, when text is stored on a computer it is stored as a series of numbers. The translation from a character to a number (and back again) is called encoding. Every unique character from every supported script is assigned a unique number, called its code point. Early computers encoded text in one of two ways, EBCDIC or ASCII. (Eventually ASCII became dominant.) These encoding schemes assigned numbers to each upper and lower case letter, each digit, and some punctuation marks and other symbols. (In addition some numbers are assigned special meaning to control the Teletype machine, the most popular early interactive input/output system. Those numbers are called control characters.)
While fine for English speaking computer programmers, ASCII did not
support other languages well, due to a lack of accented characters
and math, business, and engineering symbols.
Strange rules were needed to represent accented letters, for example
an 'E' character followed by an '~' character would be combined
into a single Ë
character.
This complicates text processing software, making it hard to compare,
sort, or even calculate the length of a string of text characters.
As computers became internationalized ASCII wasn't enough. Accented characters and other types of punctuation marks (and even a few additional letters and symbols) were needed for European languages. Also, the (then) new dot-matrix printers were capable of printing a wider range of characters but were limited to ASCII.
(Adapted from The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.)
ASCII only defines code points from 0 to 127, but stores each one in a byte (8 binary digits). Each byte can hold numbers (code points) from 0 to 255. So the extra 128 numbers could be used to represent extra characters. The problem is which code points to use for which character? Everyone had a different answer, and there were many incompatible encoding schemes used! Thus a text document written on one type of computer/ word processor (and some type of encoding) may or may not be readable on another computer or with another program, or printer.
In the 1980s, when the IBM-PC was produced, it had something that came to be known as the OEM character set which provided some accented characters for European languages, extra symbols, and a bunch of line drawing characters (horizontal and vertical lines arrows, etc.). These characters were assigned numbers between 128 and 255, so they didn't affect ASCII and yet still fit into a single byte. When PCs became widspread outside of the United States all kinds of different OEM character sets were dreamed up, which all used the top 128 characters differently. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In cases such as with Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.
Eventually this OEM extra (beyond ASCII) characters
became standardized in the ANSI standard.
In this standard numbers below 128 were ASCII.
The values aboive 128 were standardized into a large number of different
code pages
.
(The original IBM OEM characters became known
as code page 437
(or CP-437
or even
DOS characters
).
So for example in Israel, DOS used a code page called 862
while Greek users used 737.
They were the same below 128 but different from 128 up.
The national versions of MS-DOS had dozens of these code pages,
handling everything from English to Icelandic and they even had a few
multilingual
code pages that could do Esperanto and Galician on the
same computer.
But getting, say, Hebrew and Greek on the same computer was a complete
impossibility unless you wrote your own custom program that displayed
everything using bitmapped graphics, because Hebrew and Greek required
different code pages with different interpretations of the high numbers.
Microsoft supports many different encodings, but with no automatic
conversion between them: Windows 98 uses CP-1252
or CP-437
or some other encoding, all known as DOS Text
or ANSI
or
some such similar name.
In other countries completely different encodings are used,
especially where the alphabet is different (e.g., Japan or Russia).
An expanded encoding was standardized (along with others) by the
ISO and is today known as ISO Latin
I
(ISO-8859-1).
Many text files are compatible with this encoding and most printers
support the characters defined.
Note the code points from 0 to 127 are the same for ASCII as for
Latin I.
All the additional characters are assigned code points in the
range 128 to 255.
Thus an ASCII file is also a legal Latin I file.
While adequate for most western alphabets (those based on the Latin script), ISO Latin I is still limited: Very few math, science, business, or other symbols are present which limits the use for magazines and other publishing. The global Internet reaches beyond Europe to countries that use different alphabets or use a non-alphabetic script.
Ultimately a single, universally used by everyone, encoding is needed that supports characters for all languages (plus a rich set of symbols). Various national organizations and businesses got together and defined such an encoding. This list of characters is called the Unicode standard. This encoding defines well over 100,000 code points, but the first 256 code points are the same as for Latin I. This encoding was standardized by the ISO as ISO-10646. The set of characters that are assigned code points is referred to as the Universal Character Set (or UCS).
ISO-10646 came first but didn't define as many code points as needed. Since the year 2000 Unicode and ISO-10646 are developed in tandem so they match exactly.
Every so often characters from a new script are assigned code points and a new version is produced. Currently (2009) Unicode 5.1 is the latest version. Well over 1 million code points are reserved by Unicode and ISO-10646, but many are unassigned to allow room for future growth.
Besides assigning (encoding) characters from various scripts to code points, Unicode defines the properties of each: a letter, a digit, etc.
Rather than store or transmit each code point as a 4 byte number,
code points are translated to code units
using a method
(confusingly) also known as an encoding.
(Technically this is called a character encoding scheme, while
translating characters to code points is just an encoding scheme.
But everyone calls both encoding.)
Unicode always uses hexadecimal numbers and writes them with the
prefix U+
, so the numerical code point for the
character A
is written as U+0041
.
(In many programming languages, Unicode values are written in strings
using the prefix
, as in \u
.)
\u0041
While a code point is just a number, code units can have a variable length, with the encoding (of Unicode code points to code units) of some characters using one byte, some using two or more bytes. (The more common characters use one byte each, naturally.)
The most common scripts had their characters assigned code points from 0 to 64K (0xFFFF). Each of these code points can be represented with exactly two bytes each. This is a common enough situation that this set is known as the Basic Multi-lingual Plane (or BMP). Supplementary characters are those with code points outside the BMP, in the range U+10000 to U+10FFFF.
If a document or web page uses fewer than 256 different characters
it could encode the text (the code points) using one byte per code unit
(and thus, each code unit is one code point, the number representing one
character).
While the most commonly used code units are bytes, 16-bit (2 byte)
or 32-bit (4-byte) integers can also be used for internal processing.
There are a number of such multi-byte character encoding schemes
popular today.
UTF-32, UTF-16, and UTF-8 are
character encoding schemes for the
coded character set of the Unicode standard.
But ASCII, Latin I, Big5
(Taiwanese),
and other sub-sets of Unicode are still common.
Unicode files sometimes begin with a special sequence of bytes known as the
byte order mark, or BOM
.
This can be used to identify the
endianness of text,
indicate that the data is Unicode text, and even identify the
character encoding scheme used.
(See BOM
FAQ.)
Note that in general it isn't possible to determine the encoding scheme used;
a BOM can help in many cases but will confuse some software (some Web Browsers
will try to display the BOM for example.)
Some common (character) encoding schemes are:
UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general text representation.
UTF-16 uses sequences of
either one or two unsigned 16-bit code units to encode Unicode code
points.
Values U+0000 to U+FFFF are encoded in one
16-bit unit with the same value.
Supplementary characters are encoded using two code units,
the first from the high-surrogates range (U+D800 to
U+DBFF), the second from the low-surrogates range
(U+DC00 to U+DFFF).
The values U+D800 to U+DFFF are reserved
for use in UTF-16; no characters are
assigned to them as code points.
This means software can tell for each individual code unit in a string
of text whether it represents a one-unit character or whether it is the
first or second unit of a two-unit character.
This is a significant improvement over some traditional multi-byte character
encodings, where the byte value 0x41 could mean the letter A
or be the second byte of a two-byte character.
(UFT-16 is a common native encoding, and is what is used
by Java char
type and for many other programming languages.)
The encoding of BMP characters (code points) only, as 2-byte code units is called UCS-2. Until the year 2000 there were no supplementary code points defined in Unicode or ISO-10646. According to Wikipedia, the People's Republic of China (PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support the GB18030 character encoding scheme. This required computer systems intended for sale in the PRC to move beyond the BMP and UCS-2 wasn't sufficient anymore.
UCS-2 was later extended to permit one 2-byte code unit for
characters in the BMP, and two 2-byte code units for the others.
This popular encoding is known as UTF-16
and is used
internally by most operating systems today.
Microsoft calls this the Unicode
encoding.
The two code units that comprise a code point with UCS-2
are called a surrogate pair.
UTF-8 uses sequences of one to four bytes to encode Unicode code points. Code points U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.
There is no way to mark a text files as using a certain encoding.
You just have to know the encoding used for any text.
To help with this,
every operating system has a default or native encoding.
When reading text files (or standard input) you need to translate the
native encoding to your program's internal encoding, typically
UTF-16.
When writing output to a text file, a translation from the internal encoding
to the native encoding is needed.
(Or things like the copyright symbol end up funny looking.)
Of course most software allows the program to specify the character encoding
to use when reading or writing files, so you don't need to use the native
encoding.
(I generally prefer UTF-8
).
In addition XML files, web pages, and some other types of
text documents can specify the encoding that was used in a header
(special data at the top of the document).
For any computer you can easily enter the characters represented by keys
on the keyboard.
But how do you enter accented letters or other characters that don't
have keys of their own?
On Windows you can enter these using the ALT key and the numeric keypad digits.
Holding down the ALT key, you then type the code point number
(in decimal) using the keypad.
For example, to enter é
you would type ALT+130
, and for
¿
you would type ALT+168
.
Some applications have a short-cut way to enter characters with accents.
In Microsoft Word for example you can type
CTRL+'
(control+apostrophe), and then some
letter such as e
or E
.
For ü
you can type CTRL+:
and then u
.
(Word and other applications support many such shortcuts.)
Keys such as CTRL+' are sometimes called
dead keys since nothing
appears immediately when you type; the key appears dead
.
You can also lookup the ALT number for any character using the
Character Map
accessory on Windows.
That also allows you to copy and then paste those characters.
In addition it will tell you the Unicode (hex) value for that
character (which is the same as the ALT number, but that is in
decimal and not hex).
The term font
is vague and different people use it in
different ways.
For now assume a font an array of tiny
graphics created by an artist to
share a particular look, and to map
to letters, digits,
and other characters from some script.
(A set of glyphs designed with a particular, consistent look, is
known as a typeface.
For the real story search the Internet and visit
Unicode.org.)
These graphics are called glyphs
.
Glyphs are to characters what numerals are to numbers:
a visual representation of an abstract concept (e.g. the
letter A
).
Many different glyphs can represent the same character;
they just look different.
(The Latin script's Capital letter A
might look like
any of the following, which are all different glyphs for this
one character:
A,
A,
A, and
A.
Even more confusing is the fact that different characters from
different scripts may have similar or identical glyphs, for
example O
(Latin/Roman capital Oh),
0
(zero), Օ
(Armenian Capital Oh), and
○
(Circle).
As a final confusion, not all glyphs represent characters from
any script.
I'm not talking here about ding-bats
or other non-alphabetic
characters.
As an effort to improve the appearance of written text some sequences
of characters are represented by a single glyph.
For Latin script letters this is commonly done for the lowercase letter
f
when followed by certain other letters:
waffles
can be written as waffles
, and
file
as file
.
Type designers knew some things about how humans read text, and
devised serif
fonts which are letter shapes
composed of lines (or strokes
) of varying
thickness and small extra bits on the ends of the strokes.
Text of such fonts is much easier to read and pretty much all books
and magazines use serif fonts for body text.
(As should you!)
Text without the extra bits, and often drawn with lines of constant
thickness, are called sans-serif
(sans
is
French for without
) and are used for attention-grabbing such
as for headings and captions.
In the early computer era usually a single screen font was built into
terminals.
Printers were based on daisy-wheel or line-printer
technology, again that supported a single font.
These early computer screens and printers were limited to drawing
each character in the same sized rectangular block.
Such fonts are called mono-spaced
since all
characters take up the same amount of horizontal space.
This leads to an un-even appearance as fat letters such as
'm' take the same space as skinny letters such as 'i'.
As the technology grew more sophisticated computers and printers
became capable of displaying traditional fonts called
proportional
.
In these fonts the space between the characters is the same, giving
the text an even appearance.
(Are you reading this in a mono-spaced or proportional font?
Look at this to decide: MMMMiiii
.)
So a font can be either proportional or mono-spaced.
It can have serifs or be sans-serif.
That's four possibilities, but fonts can have other attributes
such as heaviness of the strokes (e.g., bold
) or if the
letters are straight (roman
) or slanted (italics
).
There are actually many attributes that define fonts.
OpenType fonts and CSS font properties
use a system known as
PANOSE
to specify font characteristics.
For example, the weight of a font can be specified as one of
these 9 values (from lightest to heaviest): 100, 200, 300, 400,
500, 600, 700, 800, 900.
400
usually corresponds to a font's normal
weight,
but there is no standard mapping of terms such as bold
or
demi
to these numbers.
There are a couple of other issues you should know about.
One is that the size of fonts is usually measured in
points
, which are about 1/72 of an inch.
This unit worked well with early font technology since dot-matrix
printers and computer monitors had 72 pixels to the inch.
(Horizontally anyway; monitors often use rectangular pixels that
are taller than they are wide.)
A shortcut
was taken for fonts where the font designers
assumed 1 point = 1 pixel.
Today's monitors can use much smaller pixels and they are spaced
closer together.
This is called the monitor's DPI
(dots per inch).
This is why when you increase a monitor's resolution, most fonts
come out looking tiny.
Some software is smart enough to correct for that.
Another issue is that different font files store the glyph
data in different formats.
Your software must be able to read the format or it can't use
the font.
Some common formats include TrueType
, OpenType
,
and PostScript Type 1
fonts.
(There are other standards for font files such as FreeType;
Web pages will get support in HTML 5 for a
new
font system called WOFF (Web Open Font Format).)
Microsoft created a set of fonts that it hoped would be widely
distributed with all operating systems.
Known as the
core web fonts
these are included with Windows and
Mac OS X, and they are freely downloadable for Linux.
The collection includes 10 typefaces:
the popular Verdana and Georgia, reworked versions of Times and Courier,
Trebuchet MS, Andale Mono (has distinctive glyphs for commonly
confused letters such as oh and zero), Impact, the Helvetica-esque
Arial, the Webdings dingbat font, and the seldom-used Comic Sans.
Besides these the Java runtime includes the Lucida family of fonts,
making them available on any system with Java installed.
These typefaces were specifically designed for screen use and have since become the most commonly used typefaces on the Web. While quite serviceable, such a small set of fonts is limiting to designers. Newer web browsers support downloadable fonts using CSS or JavaScript, such as those from openfontlibrary.fontly.org.
To take advantage of downloadable fonts, Google has released a large number of great-looking fonts you can easily use in any web page. (I've used several in this document.) See the Google Font Directory (See the getting started guide to see a good introduction on using these.)
Not all font files use Unicode to translate characters
to glyphs.
(The ones that do are usually called Unicode Fonts
.)
For these reasons you may have installed some font and find
that some software can't use it, even if other software on
your system can.
No readily available fonts include all Unicode characters and symbols.
About the most complete font available is Arial Unicode MS
, which
includes glyphs for over 50,000 Unicode characters.
(That's still just a tiny percent of the number of characters defined!)
There is a whole lot more to the story including ligatures, kerning, leading, and other fascinating (to me anyway) facts and history. (Did you know that originally printers (human ones) traveled with cases containing little wooden (or soft metal) font blocks? The capital letters were used much less often than the others and were stored in the top or upper part of the case while the rest were kept in the more convenient lower part of the case, and that's how we got the terms lowercase and uppercase letters. Is that interesting or what?)
From spoken language to a written script of characters, to encoding, to fonts: a long journey. But this is just the start! Every language has rules for how to combine the characters in a script into meaningful words or symbols. (For example, is text written left-to-right, then top-to-bottom, or right-to-left, top-to bottom, or top-to-bottom, then right-to left?) Every language has rules for capitalization, sort order of text, where lines can be broken, etc.
Every culture has rules
for using the language, including such
items as how to represent numbers, times, dates, names, addresses,
and other data.
It's not sufficient to just use Unicode and UTF-8; to have
a program that can be used by everyone world-wide, you must represent
the strings of text (the messages) in each supported language.
You must determine the language and cultural aspects of the user and format
dates, numbers, etc., correctly for them.
This is the subject of internationalization and
localization, often abbreviated as I18N
and L10N
(because in English, the word Internationalization
is the letter I
, then 18 more letters, then an n
; and similarly
for localization
.)
More information on this topic for Java programmers can be found at
Locales and I18N Text (draft).