Understanding Fonts

The term font is vague and different people use it in different ways. For now assume a font an array of tiny graphics created by an artist to share a particular look, and to map to letters, digits, and other characters. (For the real story search the Internet and visit Unicode.org and see the javadoc API comments for java.awt.Font.) These graphics are called glyphs. Glyphs are to characters what numerals are to numbers: a visual representation of an abstract concept (e.g. the letter A). Many different glyphs can represent the same character; they just look different.

Type designers knew some things about how humans read text, and devised serif fonts which are letter shapes composed of lines (or strokes) of varying thickness and small extra lines on the ends of the strokes. Text of such fonts is much easier to read and pretty much all books and magazines use serif fonts for body text. (As should you!)

Text without the extra lines, and often drawn with lines of constant thickness, are called sans-serif (sans is French for without) and are used for attention-grabbing such as for headings and captions.

In the early computer era usually a single screen font was built into terminals. Printers were based on daisy-wheel or line-printer technology, again that supported a single font. These early computer screens and printers were limited to drawing each character in the same sized rectangular block. Such fonts are called mono-spaced since all characters take up the same amount of horizontal space. This leads to an un-even appearance as fat letters such as 'm' take the same space as skinny letters such as 'i'. As the technology grew more sophisticated computers and printers became capable of displaying traditional fonts called proportional. In these fonts the space between the characters is the same, giving the text an even appearance. (Are you reading this in a mono-spaced or proportional font? Look at this to decide: MMMMMMllllll.)

So a font can be either proportional or mono-spaced. It can have serifs or be sans-serif. That's four possibilities, but fonts can have other attributes such as heaviness of the strokes (e.g., bold) or if the letters are straight (roman) or slanted (italics).

Early Unix systems used X Window fonts that were named for each of 14 possible font attributes. You could find a bold, 12 point (one point is roughly 1/72 of an inch) font by doing a directory listing for:
-*-*-bold-*-*-*-12-*-*-*-*-*-*-*
which might find the font file:
-adobe-utopia-bold-r-normal--12-120-75-75-p-70-iso8859-1
Nowadays fonts have names such as Helvetica or Bookman DemiBold which is much less helpful.

When web designers or Java programmers set a font to use, they typically don't know what fonts are available on the user's system. So if you guess to use a font Ariel it may or may not be available. However all home computers ship with a set of fonts standard for that platform.

To solve the problem for Java programmers, for every platform Sun supports they picked 3 available fonts (these are the actual fonts installed and are called physical) and gave them names programmers can use in their programs. The names are called logical font names since the names reflect the type of the font:

Monospaced (usually sans-serif too, but not necessarily),
SansSerif (a proportional font), and
Serif (also proportional).

So in a program you can chose the system-specific mono-spaced font for code listings, sans-serif for headings and captions, and serif for body text. And you don't need to know the real or physical name for that font.

Most home computer systems also have a distinct look and feel to the pop-up dialog boxes and other system elements (e.g., window titles). You can use these as well as Sun has kindly defined the system standard fonts for dialog boxes as Dialog and DialogInput logical font names, so you can make your dialogs appear native.

Remember each font is a collection, an array or vector of graphics known as glyphs. (It's more than that really.) Each glyph is identified by a number. For example a capital A glyph in any font that has such a glyph is identified by the number 65. Of course anyone can make their own collection of glyphs as a font, and identify any glyph with any number. For Java and most software today the mapping of numbers to glyphs is the one defined by the Unicode standard. The numbers are called code points.

Unicode had defined numbers for many thousands of glyphs and it is unlikely you have a single font file that has every glyph defined by Unicode. This can be a problem since you don't always know if the font you're using has a glyph for all the symbols, arrows, smiley faces, Greek letters, math and engineering symbols, etc., that you might want to use. If that is a problem you can find a font that can display the characters you need. In Java use code such as this:

 for (Iterator<Font> i = fontList.iterator(); i.hasNext(); ) {
    Font f = i.next();
    if ( ! f.canDisplay( '\u25B6' ) )
       i.remove();
 }

(See UnicodeSymbols.java for a sample applet with source that does this.) Fonts will claim they can display a character if it falls in the covered range of that font, even if there is no glyph for it!

There are a couple of other issues you should know about. One is that font sizes are usually defined in points, which should be about 1/72 of an inch. This unit worked well with early font technology since dot-matrix printers and computer monitors had 72 pixels to the inch. (Horizontally anyway; monitors often use rectangular pixels that are taller than they are wide.) A shortcut was taken for fonts where the font designers assumed 1 point = 1 pixel. Todays monitors can use much smaller pixels (often 1/96 of an inch) and they are spaced closer together. This is called the monitor's DPI (dots per inch). This is why when you increase a monitor's resolution, most fonts come out looking tiny. Some fonts are smart enough to correct for that, so when you specify a font size that's what you'll see. For other fonts you (the programmer or web designer) must manually adjust for the different DPI of screens and even of printers. (Java AWT toolkit provides a method to find the DPI value, allowing a Java programmer to adjust manually if needed.)

Another point about using font names: You may not be able to pick a single physical font that contains all the Unicode charaters, or even a significant number of them, on some (older) systems. This is because many physical fonts only define 256 glyphs. In this case the logical font name may actually refer to several physical fonts that are stitched together to define lots more glyphs than any one standard font available on your system. Fore example, with Java 6 on Windows XP, the fontconfig.properties file defines the logical font serif to use the physical font Times New Roman for the standard 256 aplhabetic glyphs, and the font MS Mincho for any Japanese glyphs. By using the logical font name Serif you can use specify any Latin, Chinese, Hebrew, Japanese, or Korean glyph and it will display correctly, even though there isn't a single physical font that contains all those glyphs in the Windows XP standard set of fonts. If you used the physical font Times New Roman and your text contained the Unicode number for some Japanese glyph, it wouldn't display correctly. (Ususally the system displays a square or question-mark in these cases.)

Still another issue is that different font files store the glyph data in different formats. Your software must be able to read the format or it can't use the font. Currently Java can read TrueType and OpenType font formats, but probably not other formats such as PostScript Type 1 fonts. On the other hand older Unix/Linux systems seems to only recognize PostScript fonts and not TrueType or OpenType. (Today most systems can use OpenType. See www.prepressure.com/fonts/basics/history for a quick but through history of computer font formats.)

In addition, not all font files use Unicode to label the glyphs; most software needs the Unicode code points to identify the glyphs. For this reason you may have installed some font and find that some software can't use it while other software can.

A final issue is one of encoding text. Text is composed of a series of numbers that identify glyphs. This text doesn't change if you change the font you use to render (i.e., draw) the glyphs. The problem is the numbers take up 4 or more bytes in Unicode. So how should the text string ABC be stored? It might be 4 bytes per number, or two, or even one, with special rules to handle large numbers.

This is called the text encoding. Internally Java uses two byte numbers, with a special convention to represent characters defined in Unicode with numbers larger than 65,536 (the biggest value you can store in two bytes). This is why the Java data type char is a two byte value. However most operating systems historically used one byte numbers. This is because before Unicode, most Western fonts contained fewer than 256 glyphs, more than enough for the latin alphabet commonly used in the US and the UK. When reading or writing text a program must pick the proper encoding.

Unfortunately there is no easy way to tell what encoding to use, or what encoding is used. If you use a web browser the web page (which is just text) contains a header stating what encoding is used. Try changing that and see the results, especially for curly quotes and bullets.

Some Technical Terms Defined

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph of some font), and it doesn't have an intrinsic value. The letter A is a character and so is € (the symbol for the common currency of Germany, France, and numerous other European countries).

A glyph is an element of writing. Two or more glyphs may represent the same symbol, called a grapheme or character. Glyphs may also be ligatures (compound characters) or diacritics (accent or other marks).

A character set is a collection of characters. For example, the Han character set is the set of characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese. Other character sets you might have heard of include ASCII, ISO Latin I (also called by its number, ISO-8859-1), Unicode, or Microsoft's cp-1252. A character set is often defined by the symbols used in some writing system (or script), such as English.

A code point is a number used in a character set to identify each charater. A character set with such numbers is called a coded character set. Code point numbers are usually referred to simply as code points. Note that a coded character set defines a range of code points but may not assign characters to every code point in that range. (So if you used a for loop to generate all Unicode copde points and displayed them, some are undefined no matter what font you use.) This assignment of numbers to characters is sometimes called an encoding, but that term has other uses.

A typeface is a design for a set of glyphs for one or more character sets in one or more sizes. All the glyphs in a typeface are designed with stylistic unity. Put another way, each typeface comprises a coordinated set of glyphs. (For example Arial is a typeface.)

A font is traditionally defined as a set of glyphs in a single size and style of a particular typeface, for some character set.

A font style usually refers to a charactistic of a font such as italics or bold. In fact a typeface is a collection of fonts all with the same design; fonts are sold or distributed in sets named for the typeface; for example Lucidia is a set of eight fonts. Because of this a typeface today is usually called a font family instead.

After the introduction of computer fonts a broader definition of font evolved. Today a font is no longer size-specific but still refers to a single style. Today (and in the computer industry) the term font means the same thing as typeface. For example you could today refer to Arial as a font or as a typeface, but few outside of the print industry use the term typeface anymore. (The term font family is still commonly used.)

Early Unicode versions defined fewer than 65,000 characters and code points, so using an unsigned two byte number for char worked well. Unicode version 4 defined 96,382 characters and code points, The first version to require more than 2 bytes each. The set of code points that still fit into a two-byte word are called the Basic Multi-lingual Plane (or BMP). The others are called supplimental characters. (Remember this only applies to Unicode.)

Unicode is an evolving standard. The current version (5.1 as of 2009) has defined 240,295 code points, of which 100,713 are assigned characters and the rest are private use characters and non-characters (code points with specific meanings but which don't correspond to a character).

In a string of characters you need to record the Unicode code points. However few (or no) operating systems can deal with Unicode's 4 bytes per charater code points. Even Java only will use two byte numbers. A character encoding scheme (often just called an encoding is used to translate a series of code points (that is a string of text from the Unicode character set) into a series of code units that are one to four bytes for each code point. For example the UTF-32 character encoding scheme uses one 4 byte code unit per code point; basically this stores the raw code points.

The UTF-16 encoding scheme uses 2 byte code units. It stores code points from the BMP as is. The supplemental code points are translated into pairs of code units. The first code unit of the pair starts with 55296 (0xD800) to 57343 (0xDFFF). (So when reading a file in UTF-16 you can look at any code point and see if it is from the BMP or the start of a 2 code unit (4 byte) value. Most common is the UTF-8 character encoding scheme, which uses 1 byte code units and represents each charaters with 1 to 4 code units.

Other character encoding schemes are common as well, such as ISO-8859-1. (Note that encoding schemes often use the same name as a standard character set; this is not a coincidence!)

Microsoft created a set of fonts that it hoped would be widely distributed with all operating systems. Known as the core web fonts these are included with Windows and Mac OS X, and they are freely downloadable for Linux. The collection includes 10 typefaces: the popular Verdana and Georgia, reworked versions of Times and Courier, Trebuchet MS, Andale Mono (has distintive glyphs for commonly confused letters such as oh and zero), Impact, the Helvetica-esque Arial, the Webdings dingbat font, and the seldom-used Comic Sans. Besides these the JRE includes the Lucidia family of fonts.

These typefaces were specifically designed for screen use and have since become the most commonly used typefaces on the Web. While quite servicable, such a small set of fonts is limiting to designers. Newer web browsers support downloadable fonts using CSS or JavaScript, such as those from openfontlibrary.fontly.org.

There is a whole lot more to the story including ligatures, kerning, leading, and other fascinating (to me anyway) facts and history. (Did you know that originally printers (human ones) traveled with cases containing little wooden or lead font blocks? The capital letters were used much less often then the others and were stored in the top or upper part of the case while the rest were kept in the more convenient lower part of the case, and that's how we got the terms lowercase and uppercase letters. Is that interesting or what?)

Using Fonts in Java

Modern AWT does include classes and methods to list all available fonts installed so you can look for specific (physical) fonts. However in AWT you are limited to only using the fonts chosen for the logical font names! With swing you can use any font available:

import java.awt.*;

public class ShowFonts {
public static void main ( String [] args ) {
   Font[] fonts = GraphicsEnvironment.getLocalGraphicsEnvironment()
                  .getAllFonts();
   for ( int i = 0; i < fonts.length; ++i ) {
      System.out.print( fonts[i].getFontName() + " : " );
      System.out.print( fonts[i].getFamily() + " : " );
      System.out.println( fonts[i].getName() );
   }
   System.out.println( "\n\n\tAvailable Fonts:\n" );
   String[] names = GraphicsEnvironment.getLocalGraphicsEnvironment()
                  .getAvailableFontFamilyNames();
    for ( int i = 0; i < names.length; ++i )
       System.out.println( names[i] );
}  // end of main
}

Still it may be a problem to not know the actual font used by some logical font name, as different proportional fonts can do line breaks in different places and mess up the carefully planned appearance of your application or applet. For this reason the JRE ships with a set of related fonts called Lucida. These physical fonts are available in all Sun JREs and include mono-spaced, Sans-Serif, and Serif versions. (Look in .../jre/lib/fonts on your system.)

In short Sun has identified three of the fonts on each platform and given them logical names. You can use one of these three or the platform-specific dialog fonts (making five logical font names in all), or pick some actual font name (a physical font) and hope it is available. It will be if you pick Lucida and you have a Sun JRE.

Actually using fonts in Java is easy:

  Font titleFont = new Font( name, style, size );

where name is a logical font name or the name of a real font installed on your system, stye is one of:

Font.PLAIN
Font.BOLD
Font.ITALIC
Font.BOLD+Font.ITALIC (or equivalently, Font.BOLD|Font.ITALIC)

and size is the size specified in points, which is probably really pixels on most monitors. Note the size is the average height of the alphabetic glyphs.

The styles are limiting, you can't specify a demi weight or slanted instead of italic in Java. However most physical font files are named for the actual style: Regular, DemiBold, or Bold. So you can pick a physical font name incuding the whole style part of the name.

OpenType fonts and CSS font properties use a system known as PANOSE to specify font characteristics. For example, the weight of a font can be specified as one of these 9 values (from lightest to heaviest): 100, 200, 300, 400, 500, 600, 700, 800, 900. 400 usually corresponds to a font's normal weight, but there is no standard mapping of terms such as bold or demi to these numbers.