Part 1: Storage Volumes, Partitions, and Slices

Storage Volume Overview

A disk contains a number of consecutive sectors that appears as a single storage volume. Using partitioning, a single disk can present the OS with multiple volumes, commonly called partitions or slices. Each appears as a separate disk volume (drive) to the users. Using RAID or other techniques, multiple disks and partitions can appear as a single volume. A storage volume allocates space to files in units called logical disk blocks. These are some multiple of the sector size.

UFS (Unix file system) and some other FSes (filesystems) support allocations of fractions of a disk block called fragments. Using fragments can save space when you have many small files, but takes longer to access (more work is needed).

The type of a storage volume is called a filesystem (and in fact storage volumes are commonly just called filesystems). Examples include FAT, ext3, and UFS. The disk block(s) at the start of a storage volume hold filesystem info such as the size, time since last check, mount count, volume label, and other data. It also holds tunable (changeable) parameters. This is known as the superblock. This vital data is copied to memory when a disk is mounted.

Modern Unix and Linux filesystems divide a storage volume into multiple cylinder groups (or block groups or extents), typically about 16 cylinders per group. The part of the kernel that manages storage (confusingly called the file system!) makes every attempt to keep all the blocks of a single file within one group; keeping blocks close to each other makes disk access more efficient (reduces head movement). Unless the disk is very full (>90%) no fragmentation occurs. The system also keeps a copy of the superblock at the start of every group.

Fun facts: Unix filesystems reserve about 5% of the available space for root only use.

The number of inodes is fixed when some types of filesystem are created; more modern filesystem types allow the inode table to start small and grow as needed.

The superblock, cached inodes, and other data are written to memory, and only “flushed” to disk every so often (e.g. 30 seconds, or when the disk is otherwise idle).

The umount command runs sync to flush this data, but may not do it right away. You can manually run sync when shutting down a system manually.

If you foolishly use fsck on a mounted filesystem the sync will re-corrupt the fixed disk copy from the memory copy! (Since a mounted filesystem will likely cache some data in memory, the on-disk version will always appear to be corrupted to fsck.)

Note that disks may have (large) internal write caches, so sync doesn’t guarantee the data is flushed to disk.

Partitions and Slices There are different schemes for partitioning a disk. For workstations DOS partitions (a “DOS” or “MBR” disk) are common (including for Linux). Sadly there is no formal standard for “DOS” partitions, the most common disk type! (Typical MS tactic.) Other OSes (Solaris, *BSD) use their own partitioning scheme, inside one DOS (primary) partition on IA. The concepts are the same for all schemes.

Older Sparc servers use “VTOC” and new ones “EFI” partitioning schemes. Macintosh uses “APM” (Apple Partition Map) for PowerPC Macs and a GPT for x86 Macs. The Sun schemes as well as the more common (and vendor-neutral) GPT scheme are discussed below.

The scheme used matters since the BIOS (or equivalent) as well as the boot loader must be able to determine the type and location of partitions on a disk. This is why you can’t dual-boot Windows on Mac hardware: the Windows loader assumes DOS partitions and doesn’t understand GPT partitions.

The MBR (Master Boot Record) is the first sector (512 or 4096 bytes) on a DOS disk. It contains 446 bytes of boot code and a 64 byte partition table or map. The last 2 bytes mark the sector as a DOS MBR (0xAA55). Each entry in this table says where the partition starts and ends (or its size), and the type. (You can restore the MBR boot code without wiping out partition information using the DOS command FDISK.EXE/MBR.) Otherwise you can restore the previous MBR if you have one saved in a file.

Most systems start all partitions on a cylinder boundary. Those that don’t can confuse other OSes in a dual-boot system. This does mean that often there is unused space (cylinder 0 only uses the first sector for the MBR). Some fancy boot loaders use this space, as do some viruses.

Each partition contains one or more boot blocks, which contain information about that partition, an OS specific loader (may be put in MBR too), and other info (e.g., the location of the OS program a.k.a. OS image).

The number of DOS partitions per disk was originally four. Later special extended types were invented. The current scheme allows you to mark one of the four primary partitions as extended. This extended partition includes all remaining sectors not used by the first three partitions (you don’t have to have all three used, or any). The first sector of this extended partition contains an MBR that contains two entries: one for the logical partition and the second for an extended partition for the rest of the disk. Inside that you’ll find another MBR pointing to another logical partition and yet another extended partition.

Each of the logical partitions is preceded by an MBR, forming a linked list of logical partitions. In theory there is no limit to this but in practice the BIOS and OS will have limits on how many partitions total will be seen. These limits may differ for IDE and SCSI (for no good reason; just a limit in the device drivers) but both are around 15-32 partitions per disk. Linux allows 15 partitions per disk (now the same for SCSI as the same device driver is used for both in Linux 2.6; 63 for IDE if the old driver is used).

Unix servers don’t use the DOS disk scheme, but use a related concepts known as disk slices. Some OSes confusingly call slices “partitions”. Annoyingly, many Unix documents and man pages mix up the terms partition and slice. You have been warned!

On x86 (IA32 and maybe IA64), in order to co-exists with Windows and Linux these Unix systems use one primary DOS partition in which they use their own partitioning scheme. Logical DOS partitions are invisible to them. Unix may be able to use other DOS primary partitions if they have FAT FSes in them. The Unix partitioning scheme doesn’t have an MBR but a similar data structure called the “disk label”, which contains the partition table/map.

The term disk label can be confusing. Each disk has a disk label which may be called the MBR, VTOC, or EFI. However each storage volume can also have a label (and a type). This is technically the volume label but is often called the disk label. A volume label is 8 bytes in length and often contains the mount-point pathname.

The fdisk utility (both Linux and Solaris) is used to label a DOS disk with an MBR. One of these primary partitions is then divided into slices by labeling that partition using the format command to write the VTOC or EFI disk label to it. The Solaris format command used to define the slices has a sub-command to label the disk, but you must remember to use it!

The partition map on a Unix disk is called a volume table of contents (VTOC). Unlike partitions slices can overlap. For example slice 1 might contain disk cylinders 0-19000, slice 2 200-19000, slice 3 100-1000, etc. It is up to the SA to make sure that filesystems are sized not to overlap when placed in different slices. Planning filesystem sizes and types is the same for both schemes.

Newer Unix systems use EFI disk labels instead of VTOC. This supports larger volumes than VTOC or DOS (over 2TB), doesn’t have the danger of overlapping volumes as with VTOC, doesn’t use disk geometry (uses LBA), doesn’t reserve slice 2 for the whole disk, and has other features. However it can’t be used for IDE disks, and isn’t supported by some Solaris GUI tools.

Partitions (and slices) usually hold filesystems, which are data structures to organize files. (Putting a filesystem into a partition is referred as formatting the partition (or disk). You can use mkfs and other programs (newfs) to do this formatting on Linux and Unix). The kind of FS used is known as the partition type: ext[234], swap (Not a FS), ... Since partitions/slices usually hold filesystems these two terms are often used interchangeably.)

Each filesystem in a partition can be labeled with a type, a name (or volume label) and a unique ID (UUID). You can view these using the blkid command and change them with fdisk. Note that many filesystem utilities use labels or UUIDs to identify partitions rather than the device pathname, so it is important to get this right when setting up partitions! Windows will ignore partitions with an unknown type but other OSes don’t care about the type. This fact can be used to “hide” some partitions from Windows on a dual-boot system.

Unix systems can detect primary partitions (Solaris calls these FDISK partitions) on a DOS partitioned disk. Each DOS disk is expected to contain a single Solaris FDISK partition as physical partition 1, usually for the entire disk. Solaris then writes a VTOC in the start of that, and uses slices. If any of the other partitions contain FAT (Solaris docs call this “DOS” but the type used is “pcfs”), NTFS, ext2, or ext3, they can be mounted and used (only pcfs supports r/w), otherwise they are ignored.

“Up until Solaris 7 you could newfs a non-Solaris FDISK partition and have a UFS filesystem on it. The command line parsing of mkfs_ufs broke in Solaris 8, and hasn’t worked since — you can’t feed the necessary parameters in even though the man page says you should be able to do so.”
— Andrew Gabriel [Solaris Guru], comp.unix.solaris post on 11/27/07.

The Solaris partition in turn can be divided into (up to) 8 slices using the VTOC on Sparc, 16 on x86. Each has a definite (historical) purpose and some can overlap others—it is up to the admin to make sure the filesystems in each do not overlap! The partitions were originally lettered ‘a’ through ‘h’, but modern Solaris numbers them 0–7. Slice 2 by default is reserved to refer to the whole Solaris partition (again, usually this is the whole disk).

By convention (and assumed by some disk management utilities) some of these slices are reserved for special uses. On a bootable partition/disk the root (“/”) filesystem should be slice ‘0’. Slice ‘1’ is for swap, and slice ‘2’ refers to the whole partition/disk and can be used for backups. Slice ‘3’ may be used to hold a bootable OS image for clients with a different architecture, but today slice 3 often contains a copy of the root filesystem used for live upgrade when you don’t have mirrored disks. Slice ‘7’ for /home (/export/home). This leaves slices 4 and 5 available for your own ideas. And ‘5’ is usually used for /opt.

Slice 7 is usually /export or /export/home. Solaris is intended as a server, so users are expected to use NFS to auto-mount their “real” home directory on /home when they login on any host. If this host is not the one containing the actual home directories, an /export/home is auto-mounted on /home if NFS is used. Only create /home directly if a host is not networked and you don’t want to use an auto-mounter.

On x86 hardware Solaris uses 10 slices (0-9) instead. Slice ‘8’ is used in the booting process, and slice ‘9’ is used to map out bad disk blocks. (So you don’t get any extra slices.)

This limit on slices is a hardship for modern systems that use large disks and could use more but smaller filesystems for added security. To address this Sun has created a “next generation” filesystem type that also deals with logical volume management. It is called ZFS (discussed below). Currently ZFS can’t be used for the root filesystem or related filesystems (e.g., /var, /tmp, ...).

The default Solaris installer creates small root and swap slices, and all remaining disk space goes into slice 7 (/export). This install will fail if you add any optional packages (the GUI, OpenOffice, etc.) software to your system (most goes into /opt) or have a lot of RAM!

Solaris versions <11 include many features that complicate planning your disk map. These include using multiple FDISK partitions, using logical volume management (Solaris supports three different LVM systems, with different restrictions), Software RAID, and ZFS. Other features such as “live upgrade” can also constrain your disk map. (After carefully researching this on the Internet, reading Sun documentation, and after several discussions with Solaris experts, I suggest hiring a consultant from Sun to set this up for your organization.)

Solaris 11 and/or OpenSolaris addresses all these issues, as you can just use ZFS for everything.

You can use an un-named slice (e.g., slice 4 or 5) to hold a ZFS pool from which you can allocate many ZFS file systems easily. You can also create a zpool from a non-Solaris FDISK partition of type “OTHER OS”, at least on x86 systems.

Or, you can use Solaris Volume Management (SVM) for a slice, and allocate many UFS file systems from that. SVM should need an additional slice (#7) reserved to hold meta (or virtual) device data, often called a metaDB replica (size: 32MB is fine). Since ZFS is still not usable for many purposes, SVM is currently recommended (2008) unless you have an entire non-boot disk for the zpool. See also VxVM (Veritas VM), a commercial LVM solution available for Solaris, HP-UX, and others.

There are differences in Solaris disk use for x86 and for Sparc. For example, while SPARC boot blocks can be contained within the root slice, the x86 boot stuff (grub) is bigger so a boot slice ‘8’ is normally created. Although this space is unused on non-boot drives it is almost always left in place.) SPARC VTOC has 8 slices, x86 VTOC has 16. format can’t manipulate the slices above #7 very well but the fmthard command can.

GPT Partitioning Scheme

Besides DOS and Unix (VTOC), another type of partition scheme is designed for use on 64 bit systems from Intel (IA64 Itanium) and Sun (new Sparc). Such systems don’t have any BIOS, but instead have EFI (Extensible Firmware Interface) that serves a similar purpose. (See UEFI.org for more info; note Intel is a member and now uses that too.) EFI uses a partitioning scheme called the GUID Partition Table (GPT). Also the GTP doesn’t limit the number of partitions, MS and other OSes (?) have a limit of 128 GPT partitions per disk. (Supports huge disks too.) You’d think such disks would be called GPT disks but the common term is EFI disk (as opposed to DOS disk). Apple hardware uses EFI, which is why you can’t install unmodified MacOS X on most PCs. Many newer motherboards include EFI with a “legacy” BIOS mode.

EFI disks can’t be used on a boot disk for Solaris 10. (?not sure this is still true?)

Logical Volume Management (LVM)

Sun calls this Solaris Volume Management (SVM). Sun, HP-UX, others often use Veritas volume management.

LVM allows multiple disks (and/or partitions) to be grouped into huge “virtual disks” called volume groups (VGs), which are named collections of specially prepared disks or partitions called physical volumes. VGs can grow and shrink by adding/removing physical volumes to/from them. While you can create many VGs, one may be sufficient. VGs appear to be block devices, similar to other disks such as /dev/hda. In fact each VG can be referred to by the name /dev/VG_name.

The VGs can be subdivided (I hesitate to say partitioned) into logical volumes (LVs — think of these as the partitions or slices of a virtual disk). VGs don’t use either the DOS scheme or the VTOC scheme, so your OS needs a special device driver to use these. (No BIOS currently can!)

Each LV can hold a filesystem. Note each LV must fit entirely within a single VG. With LVM the LVs (and the filesystems in them) can be grown by allocating more disk space to them, in chunks called physical extents (usually 4MB each). While LVs should be planned as discussed previously for partitioning, there is less danger if you miss-guess as you can always grow the LV later!

Note growing/shrinking an LV is a separate operation from growing/shrinking the filesystem within that LV. (There are exceptions.) Also, not all filesystems types support growing and/or shrinking at all. In those cases you need to backup the files to an appropriate type of backup or archive, create a new filesystem in the changed LV, then restore the files.

LVs have a number of parameters that can be set (and most can be changed later) that can affect disk I/O performance, including extent size, chunk size, stripe size and stripe set size, and read-ahead.

LVM is complex to setup (and varies among Unixes) but is compatible with RAID. (Discuss choice between RAID 0 and LVM striping.) Note that if / (root) is a LV, then you must have a non-LV /boot as it is the BIOS that finds boot loaders, and BIOS doesn’t understand about LVM.

LVM (and ZFS) Tips [from sun.com/bigadmin/features/articles/zfs_overview.jsp]

· A small number of storage pools (or volume groups) with many disks works better than a large number of small pools/VGs.

· Use whole disks not partitions.

· Try to have DB data use one controller and DB indexes use another.

· ZFS filesystems aren’t a limited resource. You can create as many of these as you want. One thing you must avoid is splitting the root filesystem as this will prevent live upgrade from working. You might have one filesystem for each Zone, projects get their own, and anything that benefits from different filesystem properties get their own. A good example of the latter is media requiring an uncompressed filesystem. You can also split backups: low volume but important data can be sent off-site, bulk data to a local mirror.

· With lots of ZFS filesystem, to make a snapshot for backups, there is a recursive option so you don’t need to make separate snapshots of each one. (LVM doesn’t have such an option currently.)

· ZFS filesystems inherit properties from their parent ZFS filesystem. So you can make a change to the parent and all the children will pick it up automatically.

Part 2: Filesystems

File Systems and Disk Formatting

Filesystem has several meanings. One is the collection of files and directories on some host; the whole tree from “/”. Another is the type (structure of) a storage volume: ext2, FAT, etc. It can also refer the type of storage or its access: disk-based, networked, or virtual. Today hard disks are by far the most commonly used data storage devices and an SA must know a lot about them.

Formatting a disk has two meanings: low-level format, done once at the factory (but redo to map out bad blocks). Once that is done you can partition a disk and add a filesystem (a high-level format) to a disk (or partition). Partitioning is done with fdisk (or GUI equivalent) on all systems (Use mdadm (Solaris use metainit) to format RAID virtual disks), while high-level formatting is done with mkfs (or a filesystem-specific tool such as format or newfs).

Common FS types: FAT, vfat (pcfs or dos on some OSes), UFS (Common to many Unix flavors; a.k.a. FFS on some OSes), ZFS (Sun’s successor to UFS and open sourced), ext2, ext3, and ext4 (current Linux standard), ReiserFS, JFS, XFS, VxFS and freevxfs (SCO, Solaris, ...), HFS, HFS+, HFSX, JHFSX (Macintosh), HPFS (OS/2).

Fedora 11+ uses ext4 by default, but its version of grub doesn’t support booting from an ext4 partition! Make sure to utilize ext3 for /boot. Newer grub (11/2009) has many new features, including ext4 support.

FAT is used on most flash and other small removable media. For CDs iso9660 (Linux) and hsfs (Solaris, High Sierra FS). Default is 8.3 ASCII names. Rock Ridge extension allow long filenames and symlinks, Joliet extension allows Unicode, and El Torito allows bootable CDs. DVDs use udfs (a.k.a. UDFS), ... Note extent-based FSes rarely need defragmenting; they work well regardless until nearly full.

All filesystems suffer from data corruption. To check and repair them you use the fsck utility for that type of filesystem. (See page 117 for details.)

Choosing a type of Filesystem

This depends on how it will be used. Not all filesystems can dynamically grow and/or shrink. Not all support ACLs, quotas, attributes or extended attributes (needed for SE Linux). Some favor speed over reliability (in the event of an unexpected system failure, not all filesystems recover well) or reliability over speed. Some support large numbers of small files better, or small numbers of large files better.

Use the fstyp command on Solaris (non-x86) to determine the filesystem type of some disk (or partition). For Linux and some others, use “file ‑s /dev/sda” and/or “parted ‑l /dev/sda”.

Modern Filesystem types use journaling to eliminate need for fsck after a power failure. ext4 is safest of these, ReiserFS, XFS require a UPS to be safe but are much faster, especially with lots of smaller files. JFS works well for larger (>16-256GB, the limit for ext3) files, ok for smaller ones, and is safer than XFS or ReiserFS. (Note Linux 2.6 file size limit of 2TB regardless of FS type.)

Solaris 10 on x86 supports most common filesystem types. You may need to install additional software packages to use them: FSWpart and FSWfsmisc. To mount such filesystems you need to learn the device name and filesystem type for each partition on each disk. The prtpart x86 tool shows this information. (Solaris fdisk doesn’t work well for DOS disks.) The mount command needs the FS type; Solaris names these pcfs, ntfs, and ext2fs (both ext2 and ext3). All but DOS (pcfs) are currently supported as read-only.

Journaling filesystems depend on the journal entries being written in the correct order. But in order to increase benchmark numbers most disk drives have on-board caching and use disk write reordering. Experiments have shown that there is a 1 out of 10 chance of filesystem corruption after an uncontrolled shutdown. Linux 2.6 kernel places write barriers at intervals to prevent such corruption, but if possible data write reordering should be turned off (hdparm ‑W 0) when using a journaling filesystem. Using RAID striping makes this situation worse as stripes can become inconsistent.

Most journaling filesystems only use the journal for metadata, not the contents of files. So if the system crashes while writing a large file to disk, the journal will make sure the hard link and block counts are correct, but you still only have half a file; the data is corrupted. Ext3 and ext4 are unusual because they journal all data (by default; you can change that). However this hurts performance.

Remember, a filesystem is not a database! “File systems are optimized very differently from data bases. Databases have transactions that can be committed or rolled back if the database or the application decides to abort a transaction. In contrast, file systems do not support the concept of rollback or undo logs, and one of the reasons for this is in order to get very high performance levels, ...” [Ted Ts’o blog post].

Ext3 and ext4 are most suited to non-server, commodity PCs. Servers using RAID and UPSes may choose a type for performance over protection. Recently some mobile devices switched to ext4 (e.g., android). This has caused a problem with some applications (e.g., FireFox) which frequently update various databases on every mouse click (awesome bar), causing annoying delays.

Other filesystem types commonly used include vfat (pcfs on Solaris; matches nearly all Microsoft FAT filesystem types, and ntfs matches NTFS) for small flash drives and floppies, and iso9660 (hsfs=high sierra on Solaris) for data CD-ROMs, and udf is the successor to iso9660, used on DVDs and flash drives >32GB (the max for the FAT flash drives use standard). See mkisofs, growisofs, cdrecord (run 1st with -scanbus to get device number triplet; only supports SCSI so must use IDA-SCSI adopter modules) (Solaris: cdrtool), and mkdosfs. Linux supports JFFS2 and UBIFS, and LogFS for flash drives; MS has exFAT (a.k.a. FAT64); there are others too.

Without a UPS, when the power fails not all parts of the computer stop functioning at the same time. As the voltage starts dropping on the +5 and +12 volt rails, certain parts of the system may last longer than other parts. For example, the DMA controller, hard drive controller, and hard drive unit may continue functioning for several hundred of milliseconds, long after the DIMMs, which are very voltage sensitive, have gone crazy and are returning total random garbage. If this happens while the filesystem is writing critical sections of the filesystem metadata, you can corrupt the FS beyond hope of recovery. Ext3 is designed to be recoverable in this situation; other, higher performance FSes are not. — adopted from Ted T’So, linuxmafia.com/faq/Filesystems/reiserfs.html

(Point out wikipedia.org/wiki/Comparison_of_file_systems article.)

UFS is the modern version of the BSD FFS (1983). A UFS volume is composed of the following parts:

· a few blocks at the beginning of the partition reserved for boot blocks (which must be initialized separately from the filesystem).

· a superblock, containing a magic number identifying this as a UFS filesystem, and some other vital numbers describing this filesystem’s geometry and statistics and behavioral tuning parameters.

· a collection of cylinder groups. All data for a directory and its contents are kept (if possible) on one group, minimizing fragmentation. Each cylinder group has the following components:

o a backup copy of the superblock

o a cylinder group header, with statistics, free lists, etc, about this cylinder group, similar to those in the superblock

o a number of inodes, each containing file attributes

o a number of data blocks

UFS was so popular it was adopted by most Unix vendors, who sadly made many incompatible changes to the basic idea. Among the ideas were the inode structure.

Inodes (for modern UFS and similar FS types) store 15 pointers to data blocks in an inode; the first 12 point directly to data, the 13-th is an indirect block (the whole block contains pointers to data blocks), the 14-th is a doubly indirect block, and the 15-th is a triply indirect block. This is not efficient for large files.

EXT2 is fastest since it doesn’t use journaling but is very reliable. Recovery can be slow (tens of minutes to hours per filesystem, depending on size and speed of hardware). It is based on the original “ext” FS from 1992, which in turn was based on Minux FS, which was based on BSD’s UFS. This is a 32 bit FS which limits max size. Ext2 FSes can be grown or shrunk as needed.

JFS was one of the first journaling filesystems. It was developed by IBM for AIX Unix in 2001. Unlike ext* this is a 64-bit FS designed for high performance servers. It differs from traditional *nix filesystems by allowing several independent disk writes to be collected and committed as a one large “group” transition; fewer transactions means fewer journal updates and better performance. Small directory data can be saved directly in the inode itself (if it will fit), causing half as many disk accesses. Large directories are nor a simple list but have directory entries organized as a “B+ Tree”, a data structure that can be searched efficiently. Another difference is the inode table isn’t fixed in size when the FS is created, but rather inodes are allocated as needed. JFS FSes can be grown but not shrunk. JFS also supports different block sizes on a per-file basis.

XFS was developed by SGI (Silicon Graphics, Inc.) for IRIX (a popular Unix in the ‘90s). In many ways it is similar to JFS.

EXT3 is a tweaked version of ext2. The major change was to add journaling. (In fact it is easy to convert ext2 to ext3 and vice-versa.) Unlike most FSes that provide journaling, ext3 journals all writes to disk, not just “metadata” (inode and directory writes). This provides more reliability than other journaling FSes but slows down data writes. However the SA can control this with a setting (using tune2fs). Set “data=journal” for journaling everything. Set “data=writeback” to just journal metadata. Use the default of “data=ordered” to improve performance while providing almost at much safety as for “journal” mode. With this option the FS will write data to the disk (as a transaction) after making the journal entry for the affected metadata, but before writing the metadata itself.

EXT4 is based on ext3. It uses 48-bit addresses rather than 32-bit. It also changes the way large files are managed. Instead of the UFS inode pointer structure, the idea of “extents”, first used with JFS and later with XFS and NTFS among others) provides better performance for large files, such as those used with DBMSes. Overall ext4 is faster than ext3 but just as reliable. Ext4 originally didn’t support the ext3 “data=ordered” mode, but Ted Ts’o (reluctantly) added this back.

An extent based filesystem doesn’t allocate space for files a block at a time, but in large variable-sized chunks called extents. Inodes contain a list of pointers to extent descriptor blocks, organized as a B-Tree for fast access.

ReiserFS (a.k.a. Reiser3 FS) Reiser4 FS are highly innovated journaling filesystems with many, many performance improvements. However it lacks the reliability of other FS types. Once considering promising, the lead developer was convicted of murder in 2008 and the future of this FS type is uncertain.

Other types supported on Linux include MSDOS (for FAT16), VFAT (for FAT32; pcfs on Solaris), NTFS, iso9660 (hsfs on Solaris), UDF (used on data DVDs and large Flash disks), NFS, and SMBFS (networking filesystems), and special purpose FSes such GVFS or those for cluster use (GFS, GPFS).

Not available as of 2009 is BtrFS (“Butter FS”) which should have features comparable to Sun’s ZFS. It is being actively developed by Oracle and should work with their CRFS (SMB/NFS replacement). While available now (2010) and used in MeeGo Linux, it probably won’t become widely used for another year or so. Theodore Ts’o has stated that ext4 is a stop-gap and that Btrfs is the way forward, having “a number of the same design ideas that reiser3/4 had”.

Theodore Ts’o (the primary developer of ext2/3/4 FS) suggested, “People who really like reiser4 might want to take a look at btrfs; it has a number of the same design ideas that reiser3/4 had — except (a) the filesystem format has support for some advanced features that are designed to leapfrog ZFS, (b) the maintainer is not a crazy man and works well with other [Linux] developers (free hint: if your code needs to be reviewed to get in, and reviewers are scarce; don’t insult and abuse the volunteer reviewers as Hans did — Not a good plan!).” kerneltrap.org/mailarchive/linux-kernel/2008/8/1/2780064

In addition to performance (for large numbers of small files or a small number of large files), reliability (journaling metadata and/or file data), and max size, there are other differences between filesystem types: if an FS can grow or shrink, support for POSIX data (owner, group, permissions (mode), etc.), support for attributes, for extended attributes, for NFSv4 attributes/permissions, for MAC security (e.g., SE Linux) labels, support on various OSes, built-in LVM/RAID features, and quota support. How important these features are depends on the expected use of the file system.

ZFS

ZFS is Sun’s open sourced newest FS, and is available for Linux and other Unix systems too. (Note to self: make a web resource). The name originally stood for Zettabyte File System but is no longer considered an acronym for anything. The name was a misnomer since ZFS can manage 256 quadrillion zettabytes (10^33) but there is no SI unit prefix for that. (The only higher prefix is yotta, 10^24.) ZFS is a complete redesign of filesystem concepts and includes LVM and RAID built-in, as well as automatic mounting, growing and shrinking as needed, and nested filesystems. It is very scalable, supporting very large files and directories. It also includes disk scrubbing, a sort of continuous fsck.

Simple ZFS commands replace many LVM and other commands (mount, mkfs, ...) [June 2006 issue of Login; has ZFS article.]

ZFS organizes physical devices into logical pools called storage pools. Storage pools can be sets of disks striped together with no redundancy (RAID 0), mirrored disks (RAID 1), striped mirror sets (RAID 1 + 0), or striped with parity (RAID-Z, really RAID 5). Additional disks can be added to pools at any time but they must be added with the same RAID level as the pool was created with. As disks are added to pools, the additional storage is automatically used from that point forward.

ZFS file systems will grow (to the size of their storage pools) automatically. If you define more than one ZFS file system in a single pool, each shares access to all the unused space in the pool. As any one file system uses space, that space is reserved for that file system until the space is released back to the pool by removing file(s). (You can place a maximum size on a filesystem, confusingly called a quota.

ZFS file systems are not necessarily managed in the /etc/vfstab file. Special logical device files can be constructed on ZFS pools and mounted using vfstab, but the common way to mount a ZFS file system is to simply define it against a pool. All defined ZFS file systems automatically mount at boot time unless configured not to.

The default mount point for a ZFS file system is based on the name of the pool and the name of the file system. For example, a file system named data1 in pool indexes would mount as /indexes/data1 by default. This default can be overridden.

Use the format command to determine the list of available devices, then create a pool with: zpool create pool-name [configuration] device-file ...

Where configuration may be mirror, raidz, etc. Once created a default ZFS FS is created and mounted as /pool-name. To change FS parameters or to create additional FSes in the pool, use the zfs command:

# zfs create pool-name fs-name

To monitor your pools use zpool list. Likewise use zfs list (or mount). Other monitoring sub-commands are also useful.

Please refer to the man pages, zfs(1M) and zpool(1M), for more detailed information. Additional documentation may be found at oracle.com/technetwork/indexes/documentation/index.html#sys_sw (formerly docs.sun.com). Another resource is the OpenSolaris ZFS Community.

[from: sun.com/bigadmin/features/articles/zfs_part1.scalable.jsp]

ZFS is a combination of file system and volume manager; the file system-level commands require no concept of the underlying physical disks because of storage pool virtualization (i.e., LVM). ZFS is a journaling FS that uses transactions to modify the data. All of the transactions are atomic so data is never left in an inconsistent state.

ZFS only performs copy-on-write (COW) operations. This means that the blocks containing the in-use data on disk are never modified. The changed information is written to new blocks; the block pointer to the in-use data is only moved once the write transactions are complete. This happens all the way up the file system block structure to the top block, called the uberblock.

If the machine were to suffer a power outage in the middle of a data write, no corruption occurs because the pointer to the “good” data is not moved until the entire write is complete. (Note: The pointer to the data is the only thing that is moved.) This eliminates the need for journaling data blocks, and any need for fsck (or mirror resync) when a machine reboots unexpectedly. (Ext3 journals all data; ZFS, ReiserFS, and others that only journal some data aren’t as safe, but are faster.)

To avoid accidental data corruption ZFS provides memory-based end-to-end checksumming. Most checksumming file systems only protect against bit rot because they use self-consistent blocks where the checksum is stored with the block itself. In this case, no external checking is done to verify validity. This won’t catch things like:

· Phantom writes where the write is dropped on the floor

· Misdirected reads or writes where the disk accesses the wrong block

· DMA parity errors between the array and server memory or from the driver, since the checksum validates the data inside the array

· Driver errors where the data winds up in wrong buffer inside the kernel

· Accidental overwrites such as swapping to a live file system

With ZFS the checksum is not stored in the block but next to the pointer to the block, all the way up to the uberblock. Only the uberblock has a self-validating SHA-256 checksum. All block checksums are done in server memory, so any error up the tree is caught including the aforementioned misdirected reads and writes, parity errors, phantom writes, and so on.

In the past, the burden on the CPU would have bogged down the machine, but these days CPU technology and speed are advanced enough to check disk transactions on the fly. Not only does ZFS catch these problems, but in a mirrored or RAID-Z (Really ZFS RAID-5) configuration the data is self-healing. One of the favorite Sun demonstrations showcasing data self-healing is the following use of dd where c0t1d0s5 is one half of a mirror or a RAID-Z file system:

dd if=/dev/urandom of=/dev/dsk/c0t1d0s5 bs=1024 count=100000

This writes garbage on half of the mirror, but when those blocks are accessed, ZFS performs a checksum and recognizes that the data is bad. ZFS then checksums the other copy of the data, finds it to be valid, and resilvers the bad block on the corrupted side of the mirror instead of panicking because of data corruption.

In a RAID-Z configuration, ZFS sequentially checks for the block on each disk and compares the parity checksum until it finds a match. When a match is found, ZFS knows it’s found a block of valid data and fixes all other bad disks. The resilvering process is completely transparent to the user who never even realizes that a problem had occurred.

ZFS constantly checks for corrupt data in the background via a process called scrubbing. The administrator can also force a check of an entire storage pool by running the command zpool scrub. This should be done via cron 1-2 times a month.

A DBMS such as MySQL or PostgreSQL can be implemented on ZFS and eliminate redundant journaling, data validation, atomic writes, etc.

Fedora 11 makes btrfs (“Butter FS”), the next-generation Linux filesystem available as a technology preview. Btrfs is similar to ZFS in many ways. To enable btrfs, pass icantbelieveitsnotbtr as a boot potion.

Part 3. Storage (Disk) and Related Technology

The Storage Hierarchy [From queue.acm.org "Flash Today" , by Adam Leventhal, September 24, 2008]

Primary storage can be summarized as a unit of storage, or more precisely as controller containing CPUs and DRAM, and attached to disk drives. The disks are the primary repository for data while some memory (DRAM) acts as a very fast cache. Client software communicates with the storage via read and write operations. Read operations are always synchronous in that the client is blocked until the operation is serviced, whereas write operations may be either synchronous or asynchronous. For example, video streams may write data blocks asynchronously and verify only at the end of the stream that all data has been quiesced; databases typically use synchronous writes to ensure that every transaction has been committed to stable storage.

The speed of a synchronous write is limited by the latency of nonvolatile storage, as writes must be committed before they can be acknowledged. Read operations first check in the DRAM cache, which can provide fast service times. But cache misses must also wait for the disk. Since it’s common to have working sets larger than the amount of cache available, even the best prefetching algorithms (a technique to anticipate the next read and fetch it into the cache) will leave many read operations blocked on the disk.

The common technique today to reduce latency is to use 15,000-RPM drives rather than 10,000- or 7,200-RPM drives. This will improve both read and write latency, but only by a factor of two or so. This can be pricey. For example, a 10-TB data set on a 7,200-RPM drive (from a major vendor, at 2008 prices) would cost about $3,000 and dissipate 112 watts; the same data set on a 15,000-RPM drive would cost $22,000 and dissipate 473 watts. The additional cost and power overhead make this an unsatisfying solution.

A better way to improve the performance of synchronous writes is to add NVRAM (nonvolatile RAM) in the form of battery-backed DRAM, usually on a PCI card. Writes are committed to this NVRAM ring buffer and immediately acknowledged to the client, while the data is asynchronously written out to the drives. This technique allows for a tremendous improvement for synchronous writes, but NVRAM is expensive, batteries can fail (or leak or even explode), and the maximum practicle size of NVRAM buffers tends to be small (2-4 GB)—small enough that workloads can fill the entire ring buffer before it can be flushed to disk.

This is where using flash memory (a SSD) for an NVRAM ring buffer is becoming popular. However while achieving good write bandwidth is fairly easy, the physics of flash dictate that individual writes exhibit relatively high latency. It’s possible, however, to insert a small DRAM write cache on top of the NVRAM buffer, treating it as nonvolatile by adding a capacitor that in case of power loss provides the necessary power to flush outstanding data in the DRAM to the flash cache.

Hard Disks

(Show hardware graphic: Spindles, platters, heads.) Sectors and cylinders (tracks) and heads (platter faces) = disk geometry (“CHS”). Discuss speed (5,400–15,000 rpm; seek time). Each read/write chunk is one sector (512 bytes on all disks up to 2011; starting in 2011 all disks will have 4096 byte sectors). (Cylinder/track 0 is the outermost one on any magnetic disk.)

One block (a.k.a. a cluster) is smallest chunk of disk that can be allocated by the OS to a file, so one block is smallest file size. On Sun systems (UFS) a block is 512 bytes (1 sector) of data, on Linux (ext[23]) it is 1024. (On Reiser4FS it is 1 byte!)

The 512 byte sector size dates from the earliest IBM floppy disk standards. But there are problems with this size. Newer disks use weaker signals to record the data, so more parity bits are needed per sector. Currently (2010) each 512 byte sector has 40 bytes reserved for the low-lever formatting “start of sector” mark, and 40 bytes reserved for parity. (Prior to 2004 only 25 parity bytes were needed.)

New disks will using 4096 (8 “old” sectors) byte sectors. This greatly reduces the overhead, even though the new disks will use 100 bytes per sector for parity.

Using 4kB sectors will make many operations faster and take less power. This is because the logical block/cluster size for NTFS and some other filesystems is also 4kB. Also, on x86 processors, the page size of memory is 4kB. So most disk operations are already 4kB.

Most newer BIOSes (and EFI) can support the new sector size and new OSes can too. Note that Windows XP won’t support the new disks! Western Digital will ship disks with an “emulation mode” to support older firmware and OSes. In this mode performance may suffer. It would probably be better to use a virtual machine for WinXP.

The IDEMA has mandated the change starting in 2011. It wouldn’t hurt to stock-pile some older type disks if you need them!

Unformatted Capacity Versus Usable Capacity

Disks are sold by their total unformatted capacity. Both low-level and high-level formatting a disk takes a significant fraction of that space, as does the space sector list (used for mapping out bad sectors), boot code, hidden or reserved areas on the disk, RAID and LVM metadata, etc. Since most filesystems use blocks (or clusters) to allocate files, a file of size 1 block plus one byte takes two blocks of disk space. With many small files the usable space can be less than half the unformatted capacity. For example I have a 1GB flash drive with the standard FAT FS on it. After formatting there are 977 MB available. I put about 200 small files on it, leaving about 700 MB reported free. However I can’t put even one more file on it!

Some OSes allow you to adjust the block size in some filesystems to reduce wastage (but reduces I/O speed.) Others permit packing multiple file fragments into blocks.

Disk Geometry and LBA (Logical Block Addressing)

The number of sectors per track varies with the radius of the track on the platter. The outermost tracks are larger and can hold more sectors than the inner ones. The location of sectors is staggered as well, for efficiency.

But disk geometry is just a triple: #cylinders, #heads, #sectors (“CHS”). The growth of disk sizes means modern disks must lie about their true geometry. Rather than use CHS addressing it is common to use a Logical Block Addresses (LBA mode), in which each sector is given a number starting from zero. The OS uses LBA rather than CHS addresses, which is then translated by the disk to the actual sector.

Some OSes use BIOS to access disks (which is also used at power up), and (older) BIOS uses CHS addressing. So BIOS must know the official (but fake) geometry setting. (Even then the disk must translate the official geometry to the real geometry!) Others OSes directly access disks with LBAs and don’t care.

Prior to LBA the combined limitations of BIOS and ATA restricted the useful unformatted capacity of IDE hard disks on IA PCs to 504 megabytes (528 million bytes):

1024 cyls * 16 heads (tracks/cyl) * 63 sectors/track * 512 bytes/sector

Later BIOSes and ATA disks use LBA mode to work around those limits, by faking the geometry and translating to the official one. (This still leaves a BIOS disk size limit of 1024 cylinders * 63 sectors per track * 256 heads * 512 bytes per sector = 8 gigabytes; such older BIOS can’t boot from a partition beyond the first 8 GBs.) This is one motivation for the modern BIOS replacements (e.g., EFI).

Modern OSes (including Windows and Linux) are not affected by this since these OSes use direct LBA-based calls and do not use BIOS hard disk services. Also modern BIOSes support larger disks and LBA. Older BIOS limits affect booting: /boot below cylinder 1024. (LBA needs to change for 2011, for the 4kB sectors.)

RAM Disks (“ramdisks”) are a section of RAM that is used as a filesystem (and thus that RAM is not available for other purposes). These can be used to speed access to programs and other files, used while booting, or to check a disk. A ramdisk can double the battery life on a laptop! Many servers need to create many short-lived files quickly (such as PHP session files) and a ramdisk is the best choice for that. All systems support at least one ramdisk driver (or type); Linux supports several, some for special purposes. For example you will see a ramdisk mounted at /dev/shm for POSIX shared memory.

Modern Linux supports several types of ramdisks: “ramfs” and the newer “tmpfs” which uses both swap and RAM. This is mostly useful for /tmp. It can also be used to hold security-sensitive documents that shouldn’t be written to actual disk. Note you don’t format ramfs or tmpfs filesystems!

The main difference between ramfs and tmpfs are that ramfs uses physical RAM only and if that runs out your system can crash. Tmpfs uses virtual RAM (so it can use swap space as needed). Also tmpfs allows you to specify a maximum size to which the RAM disk is allowed to grow.

By itself a ramdisk is useful for temporary data files. In some cases you want to save the contents of a ramdisk to a file, and restore the state of that ramdisk from a file when mounting it. Such a file is usually called a disk image file. This ramdisk plus image file technique is used during the boot process.

Unix and Linux can mount an image file as if it were a disk. This is similar to using a ramdisk and copying an image file to/from it, but more convenient.

An initial ramdisk is often needed during the boot process. This is initialized from an image file. It is used as an initial root filesystem and contains some required /dev files, /lib files, init, etc. To make a boot ramdisk image file for Linux with all the right stuff in it, use the command:
mkinitrd OS_Version filename # old Linux command
dracut /boot/initramfs-$(uname -r).img $(uname -r)

For example: mkinitrd 2.4.9-31 /boot/initrd-2.4.9-31.img

Then edit the boot loader’s config file (grub.conf) and add this line:

initrd /initramfs-2.6.31.12-174.2.3.fc12.i686.img

An initrd images is a gzip compressed “filesystem”; really it’s just a “cpio” archive file! The system doesn’t “mount” these images; it just extracts their contents into some already created (and mounted) ramdisk. Once the ramdisk is populated from the image, the script /init (or /linuxrc) runs if possible. This can be used to load USB or SCSI drivers for such CD or floppy drives. (Show how to examine: # gunzip ‑c /boot/initrd... >/tmp/initrd.cpio; mkdir /tmp/img; cd img; cpio ‑i <initrd.cpio; less init)

Newer Linux systems have a type of ramdisk (“initramfs”) called “rootfs” that is always mounted. It is used to ensure there is always something mounted (so the kernel doesn’t have to check for an empty mount list). rootfs is also used during booting of a Linux 2.6 kernel, used as the initial ramdisk. When the real root FS is ready to be mounted, rootfs is then emptied of files (to free up the RAM). The system switches to the real root filesystem using a command usually called switch_root or pivot_root. The new root is mounted right on top of rootfs.

Creating RAM disks is easy. For Solaris: “ramdiskadm ‑a mydisk 40m” will create /dev/ramdisk/mydisk, which you must format and then mount as normal. To create and use a ramdisk on Linux (note no formatting needed):

# mkdir /mnt/ramdisk
# mount -t ramfs none /mnt/ramdisk

# mount -t tmpfs size=32m /mnt/ramdisk
# cd /mnt/ramdisk; vi foo; ...
# cd; umount /mnt/ramdisk

Live CDs and other image files must often be compress to fit the image on the media. A common technique is to use “squashFS” compression. If the file command shows some image file type as this, use “unsquashfs image” to uncompress it; Then file will show the correct type (you need to know the type to mount it).

The ramdisk starts out small and grows as needed. With tmpfs you can optionally specify a maximum size. This is a good thing to do since if a system runs out of virtual memory, ugly things will happen!

When a ramdisk is unmounted (via umount), all files in it are lost.

You can specify a max size with tmpfs, with the mount option size=size (the default is half the size of physical RAM). The size is in bytes but you can add a k, m, or g suffix. You can also specify a max number of inodes with nr_inodes=number. (The default is half the number of physical RAM pages.)

Ramdisks are rarely used anymore just for speed, since the Unix virtual memory system is so efficient. They are used for initial booting, for /tmp on Solaris and other OSes, for security, to mount (potentially a large number of) disk image files, and only occasionally for efficiency.

On Solaris you can use a RAM disk for an RAID 1 mirror. This can be useful if an application is mostly reading from the disk. In this case you can change the read policy for the mirror to first read from the RAM disk.

More common than a real ramdisk is to mount an image file (.img, .iso) using a loop device. A “jukebox” can be created by putting many image files on a large disk and mount them all. (Project 4, filesystems, discusses how to use image files.)

Historical note: Linux creates some fixed-size ramdisks at boot time:

ls -l /dev/ram*; dmesg | grep -i ram

shows the system creates 16 4K RAM disks. To set the size use the kernel (boot) parameter ramdisk_size=sizeInBytes. Unlike ramfs/tmpfs these are not initialized. Before you can mount one of these you need to format it with mkfs.

If planning on using ramdisks (say for /tmp) you will need a larger swap partition (and possibly more RAM) then the standard guidelines suggest.

Solid State disks (SSD)

This is a new technology that isn’t widely used as of 2011. It’s just a large amount on non-volatile RAM and works like a USB flash disk. If/when it gets popular and/or cheaper it will mean many changes to how storage is managed: no seek time issues, different block sizes, etc. (Of course there will be other issues!)

Flash memory can be implemented using either NAND or NOR logic. These designations that refer to the way the flash cells are arranged. NOR flash allows for random access and is best suited for random access memory, while NAND must be treated as blocks and is ideal for persistent storage. NAND flash is the cheaper and more common variety.

There are two types of NAND flash: SLC (single-level cell) and MLC (multilevel cell). SLC stores a single binary value in each memory cell, using one voltage level for “0” and another for “1”. MLC supports four or, recently (2010), eight distinct voltage level values per memory cell, corresponding to two or three bits of storage. Because of its improved longevity and performance, the conventional wisdom is that SLC is best suited for enterprise (i.e., non-consumer-grade) solutions.

NAND Flash disks are fast to read but are written in large blocks which must be erased before each write. Each block has a short lifetime as well; SLC flash is typically rated to sustain 1 million program/erase cycles per block.

Why should you care about the technology behind SSDs? Solid state drives have two problems that force them to deal with data differently than hard disk drives do: they can erase data only in larger chunks than they can write it, and their storage cells can only be written a certain number of times (10,000 is standard as of 2009) before they start to fail. This makes tasks such as modifying files much harder for SSDs than HDDs.

An SSD can only delete large chunks of data at once, usually between 512KB and 2MB depending on the drive. To rewrite a small file (say using vi) a solid state drive has to copy everything on the 512KB block except for the deleted data to memory, erase the entire 512KB chunk the file is in, and then rewrite all of it again along with the new version of the file. In SSD circles this is called garbage collection: recognizing that a file is old and invalid, removing it, and rewriting it with good data (many drives will collect little files to modify and write in big chunks, but the idea is the same). But such garbage collection significantly reduces the speed SSDs are known for, because reading, modifying, and rewriting is much slower than a simple write. Also, the SSD doesn’t recognize deleted files and continuously rewrites them during this garbage collection process. (A new SSD command, TRIM, can be sent by the OS to help an SSD recognize deleted files, but not all SSDs or OSes use this yet (2011).)

As flash cells are used, they lose their ability to record and retain values. Because of the limited lifetime, flash devices must take care to ensure that cells are stressed uniformly so that “hot” (frequently used) cells don’t cause premature device failure. This is done through a process known as wear leveling (or write-leveling). Just as disk drives keep a pool of spare blocks for bad-block remapping, flash devices typically present themselves to the operating system as significantly smaller than the amount of raw flash, to maintain a reserve of spare blocks (and pre-erased blocks for performance).

Wear-leveling has security implications: you can’t guarantee data has been erased when the cell holding that data may have been duplicated before the erase. Most flash devices are also capable of estimating their own remaining lifetimes so systems can anticipate failure and take action to protect the remaining good blocks.

It takes about 1-2 ms to erase a block, but writing to erased flash requires only around 200-300 µs. For this reason flash devices try to maintain a pool of previously erased blocks so that the latency of a write is just that of the program operation. Read operations are much faster: approximately 25 µs for 4k. By comparison, raw DRAM is even faster, able to perform reads and writes in much less than a microsecond, while disk access is about twice as slow.

Flash storage costs about $10-$35 per gigabyte for an SLC flash device compared with around $100 per gigabyte for DRAM (2010). Disk drives are still much cheaper than flash, less than $1 per gigabyte for 7200-RPM drives and about $3 per gigabyte for 15,000-RPM drives.

Directly Attached Storage (DAS)

Used to mean the disks were inside the computer enclosure. Today’s servers are small form-factor rack-mounted servers or blade servers. In either case the disks are not enclosed by the computer. With DAS you are limited to having the disks in the same cabinet. With DAS the disk attaches to an interface (SCSI or IDE). A cable connects that to a host interface (the controller), which usually allows more than one disk to be attached to the host computer.

DAS is also used for RAID devices to attach several disks to a special RAID controller. This in turn is attached to the host computer as if it were a simple disk.

Disk controllers

There are two common types of controllers: IDE (or EIDE or ATA) and SCSI.

The controllers support many options (we’ll learn some of these later). One to know about now is a write cache, which should be turned off for reliability. However leaving it on (the usual default) dramatically increases performance.

Solaris provides the tool fwflash(1M) to examine and load (or flash) firmware on some models of HBAs (which are sometimes referred to as host channel adaptors, or HCAs).

IDE/ATA

Allows two devices per IDE controller (technically called a channel), referred to as the master and the other the slave (even though neither controls the other). It is common to have two controllers per workstation, referred to as the primary and secondary. ATA disks must be within 18" of the controller, so sometimes the controller is connected to a host with a cable to a HBA.

IDE became EIDE then ATA (Advanced Technology Attachment, a.k.a. PATA, ATAPI, and UDMA) to support CD-ROM and other devices. Not as fast (~133Mbps) or reliable as SCSI, but cheaper. 18" cable max length so good only for DAS. The controller is usually integrated into the motherboard. To date there have been 6 versions of the ATA standard (plus some unofficial ones and ther newer SATA standards).

AT (since v3) allows sneaky commands to permit the disk to hide some space from the OS, using a Host Protected Area (HPA) and Device Configuration Overlay (DCO). Actually at boot time the OS can access the HPA then lock it down, so Microsoft can hide stuff there. The HPA and DCO are hard but not impossible to access with special software.

Serial ATA

(SATA or S-ATA) is the successor to IDE, which was retroactively renamed Parallel ATA (PATA) to distinguish it from Serial ATA. ~300Mbps, more than double the speed of the older IDE (parallel ATA) disks. Modern workstations uses SATA. The cables are much smaller (versus the 40+ pin ribbon cable). To the OS, each disk appears as the master on a separate IDE controller.

With older hardware parallel cables were faster since you could sent 8 (or more) bits simultaneously, compared to serial (where you might also need extra bits sent to control timing). But with modern electronics the speed that data is sent is much higher. Parallel wires generate interference, putting a limit on the top speed they can use. Serial cables are now capable of higher speeds than a parallel cable, are cheaper, and easier to work with.

SCSI

A SCSI device includes a “controller”. The SCSI controller is misnamed as it is really only a bridge connecting the SCSI bus to the host bus It is correctly referred to as a host bus adaptor (HBA). The SCSI bus can connect disks or any similar devices: CD-ROM, DVD, Tape, etc. These devices are peers on the bus, just like with Ethernet NICs.

The SCSI controller is physically part of the disk so you purchase, unlike ATA disks which do require an extra controller per disk. This makes SCSI disks more expensive than AT disks. With SANS the actual SCSI disks may be far away and a special type of NIC, an “HBA” is used to talk to them.

With IDE/EIDE/ATA/ATAPI the device driver sends commands to the controller, which sends AT (ATAPI) commands to the disk (by loading data registers, then sending a command). In contrast, with SCSI the device driver sends SCSI commands directly to the device (disk). The SCSI controller or HBA simply forwards those commands onto the bus (and sends any reply back to the device driver).

Many devices (8 to >95) can be daisy-chained together, with the end of the chain needing termination. The electrical specifications for a SCSI bus require each end of the bus to be properly terminated. You must use the appropriate type of terminator for your bus; passive, HVD or LVD. If the controller is controlling only an internal bus or only an external bus, it will usually provide termination either automatically or via BIOS configuration.

If you mix wide and narrow devices on one SCSI bus be aware that the termination for narrow devices may occur in a different place to the termination for wide devices.

SCSI IDs

Each device is assigned a unique SCSI-ID (or SCSI address), with the controller usually assigned ID 7. (The boot disk usually gets ID 0.) Older, “narrow” SCSI has 8 IDs (0–7), while “wide” SCSI has 16 (0–15). These IDs may be set automatically or manually. You may have to manually assign SCSI-IDs to avoid conflicts!

Devices on a SCSI bus have a priority. The higher IDs have higher priority. The extra 8 IDs for wide SCSI all have lower priority than 0. So the priority order of IDs is: 8 (lowest), 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7 (highest). Devices that cannot tolerate delays (such as CD or DVD recorders) or slower devices (such as tape drives) need high priority IDs to ensure they get sufficient service.

Devices such as RAID controllers may present a single ID to the bus but may incorporate several disks. In addition to the ID, the SCSI addressing allows a Logical Unit Number or LUN. Tapes and single disk drives either do not report a LUN, or report an LUN of 0.

A SCSI adapter may support more than one SCSI cable or channel, and there may be multiple SCSI adapters in a system. The full ID of a device therefore consists of an adapter number, a channel number, a device ID and a LUN.

Devices such as CD recorders using IDE-SCSI emulation and USB storage devices will also appear to have their own adapter.

SCSI LUNs

A SCSI device can have up to 8 ‘sub-devices’ contained within itself. (Some SCSI busses support more than this: 32, 128, or 254.)

The most common example is a RAID controller (single SCSI ID) with each disk (or more commonly a logical volume) in the array assigned a different Logical Unit Number (or LUN). Another example is a CD‑ROM jukebox that handles more than one disk at a time. Each CD/DVD is addressed as a LUN of that device. Most devices (such as hard disks) contain only one drive and get assigned LUN zero. Some of these devices internally ignore LUNs and if queried will report the same device for any LUN. Beware of auto-detection!

Each unique SCSI-ID/LUN gets a device in /dev. On Linux, disks will be named /dev/sd?, tape drives will be /dev/st#, and CD and DVD drives will be /dev/sr# (deprecated) or /dev/scd#.

Example: a software RAID of two disks will appear as /dev/sda (the RAID’s logical device), /dev/sdb and /dev/sdc (the disks within the array). Usually you only access /dev/sda, but may need to access the others to re-mirror or to get status. Of course, if using hardware RAID you won’t see the sub-devices at all on your system.

All SCSI devices regardless of type get assigned /dev/sg# as well. You can use the Linux sg_map command to see which devices correspond to the generic ones.

Note: A RAID controller plugs into the SCSI chain and the disks plug into the RAID controller’s internal bus, which may or may not be SCSI (so as to use cheaper ATA disks). The RAID controller usually won’t report the LUNs to the SCSI driver in your kernel.

Linux provides a large number of commands to query and control SCSI devices, all starting with “sg”. Try (as root) “sginfo device”.

SCSI Types

SCSI is faster, more flexible, and more expandable than IDE and is used almost universally on servers (e.g., HP Proliant, Dell PowerEdge), but usually is considered too expensive for workstations. Note that a disk must have the correct connectors as well as the correct on-board software to talk to an HBA (often called the controller).

SCSI is very fast (>300Mbps), reliable, and allows longer cables than IDE. There are several incompatible SCSI standards in use today. The devices/connectors are marked with different symbols (SE, LVD, and DIFF) but it is up to you to not mix incompatible ones (some devices may be marked as SE/LVD which can use either). (Mixing types can cause serious damage and risk electrical fire!) Types:

SCSI-1 8-bit bus, 5MBps, bulky Centronics connector.

SCSI-2 Same as SCSI-1 but with Micro-D 50 pin connector.

Fast 8-bit bus, 10MBps, Micro-D 68 pin connector.

Ultra 8-bit bus, 20MBps, Micro-D 50 pin connector.

Ultra2 8-bit bus, 40MBps, Micro-D 50 pin connector

Ultra3 (a.k.a. Ultra-160) 8-bit bus, up to 160MBps

Ultra-320 320MBps

Ultra-640 (a.k.a. fast-320) 640MBps but limited cable length, # devices

Wide 16-bit bus, 10MBps (same clock rate as SCSI-1), Micro-D 68 pin connector

There are also fast wide, ultra-wide (a.k.a. SCSI-3), wide ultra2, ...

iSCSI Ultra-3 SCSI but using TCP/IP and Ethernet for the bus.

Serial SCS and other new developments: not really SCSI but uses SCSI command set to communicate with devices.

Although many of these busses support high speed by default, the clock rate will automatically adjust down to the speed of the slowest device on the bus. So use 2 controllers (1 for slower tape/CDROM, one for fast disks). You can manually set the speed to adjust for longer than normal cable lengths or flaky devices.

Controller Protocol Example: SCSI Commands

An OS must use a controller’s protocol to tell it what to do. In SCSI terminology, communication takes place between an initiator (typically the host) and a target (typically one of the disks). The initiator sends a command to the target which then responds. SCSI commands consist of a one byte operation code followed by five or more bytes containing command-specific parameters. At the end of the command sequence the target returns a Status Code byte which is usually 00h for success, 02h for an error (called a Check Condition), or 08h for busy.

There are 4 types of SCSI commands: N (non-data), W (writing data from initiator to target), R (reading data), and B (bidirectional). There are about 60 different SCSI commands in total. (AT commands are simpler.) Here are a few:

· Test unit ready - “ping” the device to see if it responds

· Inquiry - return basic device information

· Request sense - give any error codes from the previous command

· Send diagnostic and Receive diagnostic results - run a simple self-test, or a specialized test defined in a diagnostic page

· Start/Stop unit

· Read capacity - return storage capacity

· Read

· Write

In all cases (both DAS and remote storage) the older parallel interfaces (P-ATA) are going away in favor of serial cables and interfaces. The reasons are simple: parallel cables are bulky, expensive, slower, and have greater distance limitations than serial cables.

RAID

RAID is a technique of using multiple disks to improve performance, fault tolerance, or both. Each level defines some combination of striping, replication, and parity. Striping means to write logically consecutive sectors to different disks, which speeds things up. Replication or mirroring makes multiple copies of data on different disks. Parity data allows the RAID controller can detect when some disk contains corrupted data; with enough parity data the controller can determine which disk is bad. The system can continue by using the remaining good disk(s).

Level numbers definitions vary. [ Show (!) www.acnc.com/raid.html ]

RAID Level	Description
0	*striping* — Spread data over disks fast reads and writes, poor reliability
1	*mirror* — duplicate data on 2 or more disks. (*Duplex* means mirror with one controller per disk.) Requires OS support. fast reads, good reliability, very expensive
0+1 (01)	mirror of RAID 0 sets reliability and costs of RAID 1 with faster reads
1+0 (10)	striping across RAID 1 sets same as 0+1 but more reliable
2, 3, 4	striping + parity (slight variations in type of parity) cheaper than mirroring with similar reliability. Slower performance due to dedicated parity disk
5	striping + (distributed) parity can survive any one disk failure with degraded performance, otherwise performance of striping (not quite) with reliability of mirroring but cheaper. Most popular.
6	5+extra parity can survive any two disk failures, with degraded performance. Popular for larger RAID systems used in SANs.

Only RAID levels 0, 1, 1+0, and 5 are standard. All these extra disks can be expensive! Low cost servers may have just two (identical) disks, and can opt for striping (performance) or mirroring (safety). With four disks you can stripe across mirrored sets, or mirror each strip set.

Parity data used for standard RAID is known as n+1 parity. Each bit of each byte of each block of the data disks (there may be more than two) are XOR-ed to produce parity bits (except for RAID 2). For example, assume RAID 3 (stripe+parity): Disk1Block1byte1 = “1011 0010”, D2B1b1 = “1100 1111”, then P1B1b1 is “0111 1101”. If a disk fails the data can be recalculated from the others. For example if D2 fails: D2B1b1 = D1B1b1 XOR PB1b1, 1011 0010 XOR 0111 1101 = 1100 1111

RAID may be supported in hardware or software. Don’t use IDE for this. (Note book omits duplexing from definition of RAID1.) Software RAID >0 will impact performance (RAID 5 in sw + ~23% slower).

If a disk in a RAID set (using mirroring) fails and your replace it, the replacement process will put a lot of load on the remaining disk(s) which may fail due to the excess load!

Using RAID storage systems in a data center (with SAN or NAS), it pays to have huge disks. The RAID system then defines logical volumes and assigns each a “LUN”. This is done in a vendor-specific way but efforts are underway to create a usable open standard for this (4/08). Different servers are assigned different LUNs. The OS sees each LUN as a disk to be partitioned and formatted.

The Linux software RAID driver is called “md” (multi-device); see mdadm and consider turning off init.d/mdmonitor if not using software RAID. See also mdmpd. On Debian see mkraid. On Solaris, metainit, metattach. Note most hardware RAID controllers are incompatible with SMART device monitoring.

Replacing a bad disk from a hardware RAID set:

Nearly all servers have status lights on the disks, although they may not always indicate a failure. Usually you can determine which is the failed disk by examining the logs which report the /dev/dsk/cXtXdX notation, then find the matching entry from the output of the RAID controller software. On Solaris this usually means looking at the output of “cfgadm -la” and run:
“cfgadm ‑c unconfigure matching_disk_entry”. This will shut the drive down and trip the LED indicator. Once the drive is replaced with a good one, run:
“cfgadm ‑c configure matching_disk_entry”. (Depending on your system (e.g., Sunfire) you run the Solaris luxadm command instead.)

Running Red Hat on HP Proliants merely need to look for the failed drive with the bright red light on, rip it out and stick a new one in. This is picked up by the hardware RAID controller and the operating system never even notices!

Software RAID Example: A RAID 1 array formatted with ext3:

# mdadm --create /dev/md0 --level=raid1 --raid-devices=2 /dev/sda1 /dev/sdb1
# mkfs -t ext3 /dev/md0

The mdadm configuration file tells an mdadm process running in monitor mode how to manage hot spares, so that they’re ready to replace any failed disk in any mirror. See spare groups in the mdadm man page.

The startup and shutdown scripts are easy to create. The startup script simply assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process. The shutdown script stops the mdadm monitor, stops the RAID 0s and, finally, stops the mirrors.

Modern disks aren’t as reliable as many believe. A recent study (Data corruption in the storage stack: a closer look by Bairavasundaram et. al, ;login: June 2008, pp. 6-14) analyzed 1.53 million disks over 41 months, and found that around 4% of disks develop errors that can be detected by computing checksums (and comparing those with stored checksums). Of the 400,000 errors found, 8% were found during RAID reconstruction and created the possibility of data loss even for RAID5. For this reason data centers use RAID6 or better in SAN/NAS systems.

LVM and (Software) RAID Recovery [From LJ article #8948 June 06 pp.52-ff]

Without LVM or RAID, recovery is as simple as attaching the disk to a different host, mounting its partitions, and copying the data. Using an external USB disk enclosure makes this task easier, you slip the disk into the enclosure, and plug it into the recovery host.

But suppose you have this scenario: two identical hard disks set up as software RAID 1 (disk mirror). On this system you have three devices (mirrored partitions): md0 (sda1, sdb1), md1 (sda2, sdb2), and md2 (sda3, sdb3). You can see this setup using fdsk ‑l /dev/sda; cat /proc/mdstat. Use md0 for /boot, md1 for swap, and md2 for LVM (VolGroup00). The LVM partition holds a single LV for / (root).

To recover you must somehow activate the RAID partitions, restore the VG and LV data, and then mount the resulting LV. The problem is much harder when you use default names for your VGs and LVs, since the recovery host will see two VGs (and LVs) with identical names! (It helps to have a specially set up recovery host that doesn’t use RAID or LVM, or at least uses different names.)

On the recovery host, use mdadm to scan the disk for the UUIDs of the RAID partitions. Use that information to edit /etc/mdadm.conf to show those RAID devices, including a non-existent second disk. Then use mdadm to activate the disk as a split mirror (a RAID 1 disk without the second disk). The command are similar to this (assuming the disk appears as sda on the recovery system):

# mdadm --examine --scan /dev/sda1 /dev/sda2 \
/dev/sda3 >> /etc/mdadm.conf; vi /etc/mdadm.conf

This will add the following 6 lines to the file:

ARRAY /dev/md2 level=raid1 num-devices=2 UUID=...
devices=/dev/sda3

ARRAY /dev/md1 level=raid1 num-devices=2 UUID=...
devices=/dev/sda2

ARRAY /dev/md0 level=raid1 num-devices=2 UUID=...
devices=/dev/sda1

Merge the “devices=” lines to end up with 3 lines, and add “,missing” to the end of each:

DEVICE partitions

ARRAY /dev/md2 level=raid1 num-devices=2 UUID=...
Ê devices=/dev/sda3,missing3

ARRAY /dev/md1 level=raid1 num-devices=2 UUID=...
Ê devices=/dev/sda2,missing2

ARRAY /dev/md0 level=raid1 num-devices=2 UUID=...
Ê devices=/dev/sda1,missing1

Activate the RAID devices (ignoring the missing disk) with: mdadm -A -s.

Next you need to make the system recognize the VGs. If the recovery system doesn’t have the same VG names in use, you can just use vgchange VolGroup00 ‑a ‑y. To work around this you must hand-edit the data files on the disk to manually change the VG name. If you have a backup of the old /etc/lvm/backup/VolGroup00, you are in luck. Edit that file and change the “00” to some number unused on the recovery host (say “01”).

Otherwise you can copy the data from the disk block itself. That contains the needed data plus a lot of binary data as well which you will strip out. There may be more than one text block; use the one with the most recent time stamp. Change “VolGroup00” to “VolGroup01”:

# SECTOR_SIZE=512 # for older disks; >2011 disks use 4096
# dd if=/dev/md2 bs=$SECTOR_SIZE count=255 skip=1 \
of=/tmp/VolGroup01
# vi /tmp/VolGroup01

The result should be similar to other files in /etc/lvm/backup:

VolGroup01 {
id = "some-long-string"
seqno = 1
status = ["RESIZEABLE", "READ", "WRITE"]
extent_size = 65536
max_lv = 0
max_pv = 0

physical_volumes {

pv0 {
id = "some-long-string"
device = "/dev/md2"

status = "[ALLOCATABLE"]
pe_start = some-number
pe_count = some-number
}

}

Finally we can recover the VG, then the LV, and then mount the result:

# vgcfgrestore -f VolGroup01 VolGroup01
# vgscan
# pvscan
# vgchange VolGroup01 -a -y
# lvscan
# mount /dev/VolGroup01/LogVol00 /mnt
# ...
# vgcfgbackup

DAS vs. remote storage: With very high speed LANs available, it becomes possible to have the CPUs (blade servers) in one place and the disks in another (rack). This can be useful in many common situations. For one thing vibration affects disk speed by as much as 50% so disks often need more expensive racks that dampen noise and vibration. Special racks can efficiently hold, power, and cool many individual disks, and provide RAID controllers for them.

A more important reason is that separating the storage from the hosts provides you with great flexibility — you can easily share disks among many hosts and allocate storage easily to any host without buying bigger disks or adding more controllers per host. You can use different types of disks for different applications (e.g., web server must be fast; data warehouse must be reliable but can be slower).

Having all the disks in one place means expensive network backup can be eliminated. Special backup systems can be expensive (e.g., a tape library with robotic arm) and it may be an enterprise can only afford one of those, so all disks must be at that location, whereas the servers can be in other rooms, buildings, or even other campuses (up to several kilometers away when using optical cabling).

A number of different remote storage technologies are available, with different trade-offs of initial and operational costs, distances allowed, speed, reliability, and security and management issues.

Network Area Storage (NAS)

NAS attaches disks to the NAS head server, which is accessed as a file server across the LAN. Windows services will map a drive letter to some NAS volume; Unix/Linux servers typically use NFS. This solution is cheap to deploy as it can use your existing network(s). (Historically this is how Novell servers worked: workstations would have limited or no disks and use a file server .)

A problem with this approach is that your network is used for other things and thus there are issues of security, reliability, and bandwidth. (Qu: what issues? Ans: if network fails or stalls, applications may crash or workstations/servers may not operate at all. If you open a large file you can swamp the network. File permissions mean little if files are sent across a network where anyone can capture them.) These issues can be addressed by using encryption and bandwidth management, or a private LAN (say in a server farm or cluster).

Companies such as Network Appliance and EMC make large data center NAS systems. Linksys and others make smaller ones for SOHO use.

Storage Area Networks

SANs uses a dedicated LAN, often with Fibre Channel, that uses SCSI commands to connect servers to RAID and/or JBOD systems, and backup devices. This solution mimics DAS (from the host’s point of view). SANs support very high speed but can be expensive, due to the extra cabling (and possibly bridging) required. With SANs the storage is made available as volumes (virtual/logical disks) called LUNs and each is accessed by a single server (at a time). Often used with clusters. But that requires SANs support cluster file systems which permit multiple hosts to mount the same LUN simultaneously. SANs also support access control, to determine which servers can access which LUNs.

ATA over Ethernet

AoE is a cheap but effective replacement for Fibre Channel or iSCSI technology, using ATA disks connected to standard Ethernet NICs.

Serial Attached SCSI (SAS)

SAS is a newer standard with the benefits of Fibre Channel but supporting a wider range of devices (e.g., SATA and SAS drives), and at lower cost. SAS supports cascading expanders (much like a USB hub can be plugged into another hub). SAS uses serial communication instead of the parallel method found in traditional SCSI devices, but uses SCSI commands for interacting with SAS devices.

SAS supports up to 16K addressable devices in an SAS domain and point to point data transfer speeds up to 3 Gbit/s, but is expected to reach 10 Gbit/s in the next few years. The SAS connector is much smaller than traditional parallel SCSI connectors allowing for the popular 2.5 inch drives.

An SAS domain is a collection of one or more drives plus a controller (service delivery subsystem). Each domain is assigned a unique ID by the IEEE (much like it does for MAC addresses).

Part 4: Mounting and Managing Storage Volumes

Storage Management

With very high capacity storage products used, an SA will need to consider having spare disks available. An SA will need to determine storage needs (useable space needed now and for the near-term) and determine an SLA for backups, recovery, space increase requests. The storage devices will need monitoring. The SA will also need to determine access policies, decide if quotas should be used. Since buying new storage is often determined by financial and political factors, you should assign LUNs for one group/department from a single storage unit, rather than group customer with similar needs per storage unit.

Device Naming (and Partition Numbering) Schemes:

All hardware has a physical name which shows its location in the device hierarchy. This is a tree-like view. Each device is identified by its position on some bus; each bus is connected to another bus, up to the top of the system. Physical devices are found by probing the buses to see what’s plugged into them. This information is collected by the kernel and exposed through sysfs (Linux) mounted at /sys, or devfs (Solaris) mounted at /devices.

The physical names are rarely used. Instead software uses the device logical names defined under /dev.

Internally the kernel uses instance names to refer to devices, such as “eth0” or “tty0”. Solaris has a file, /etc/path_to_inst to show which instance names map to which physical names. Linux has no such file; you need to use various ls* commands and look under /sys. Often the error logs show only instance names.

Linux: /dev/sdXY for SCSI disks. (Note X has no relationship to the SCSI ID.)

/dev/hdXY, X=a, b, ... for each IDE controller slot., Y=1, 2, ... for each partition.

Linux now uses the SCSI code to handle all disks, so “hdXY” no longer used.

/dev/sda4 = ZIP (zip always use partition 4 for whole disk).

/dev/fdX = floppy X=0, 1, 2, ... (Note fd0H1440 for different formats)

/dev/ttySX, X=0, 1, 2, ... serial ports

/dev/lpX = X=0, 1, 2, ... parallel ports

/dev/loopX = X=0, 1, 2, ... loopback device (for mounting image files)

/dev/sgX, X=0, 1, 2, ... Generic SCSI devices.

(These include “fake” SCSI devices such as USB scanners, cameras, joysticks, mice, and sometimes flash drives. Some USB devices pretend to be SCSI disks and have sdX names instead.)

Modern systems use device filesystems (such as udev) and these can organize the device names differently, usually in a deep hierarchy (find /dev -type d). However the traditional names are still provided, as symlinks.

FreeBSD Device Names

BSD names devices after the device driver used: /dev/acdn for IDE CD drives, /dev/adnspl for ATA and SATA (IDE) disks (n=drive number, p=slice (a.k.a. DOS partition), l=BSD partition letter (a.k.a. Solaris slice). /dev/cdn for SCSI CDs and /dev/danspl for SCSI disks. As on Solaris, BSD partition a is for the root slice, b is for the swap slice, c is for the whole disk. As with Solaris you have 8 BSD partitions (slices) per slice (Linux partition); letters a to h.

Solaris Device Names (for Solaris 10 and older, not using zfs)

Actual device names are in /devices (the Solaris equivalent of /sys) but uses links from /dev.

Solaris has software RAID using “md” virtual disks. To see what disk /dev/md/dsk/d33 really is: metastat .../d33.

The naming scheme for Solaris disks is:

IDE: /dev/[r]dsk/cAdBsC (A=controller#, B=disk#, C=slice#).

SCSI: /dev/[r]dsk/cAtBdCsD (A=controller #, B=SCSI-ID (device #), C=LUN # (usually zero), D=slice #) and /dev/sdnx (n=disk num, x=slice letter). Solaris historical partitions: a=root, b=swap, ...

Example: /dev/dsk/c0t0d0s1 or /dev/sd0b.

On x86 systems Solaris uses “pP” instead of “sD” in the name to refer to partitions rather than slices; “P” is the partition number (zero means whole disk). p1 to p4 are the 4 primary “fdisk” partitions, P > 4 refers to logical partitions. Ex: c0t0d0p2.

/dev/lofi/X = X=0, 1, 2, ... loopback device (for mounting image files)

How GRUB Names Devices

The floppy disk is named as (fd0) — first floppy. GRUB can only reference a single network interface as (nd), and this is almost always the interface the BIOS probed and configured via DHCP. It is also possible to configure a network interface by booting GRUB from floppy or other local media.

Hard disk names start with hd and a number, where 0 maps to BIOS disk 0x80 (first disk enumerated by the BIOS), 1 maps to 0x81, and so on.

A second number can be used to specify one of the partitions identified by fdisk (starting with 0 for grub 1, or 1 for grub 2).

(hd0,4) specifies the first extended partition of the first hard disk drive. The partition numbers for extended partitions are counted from 4 regardless of the actual number of primary partitions on your hard disk.

(hd1,a) Means the BSD “a” partition (slice) of the second hard disk. If you need to specify which pc slice number should be used, use something like this: “(hd0,0,a)”. If the pc slice number is omitted GRUB searches for the first pc slice which has a BSD “a” partition.

Mounting Volumes

Discuss mount, umount, /etc/fstab (BSD too; Solaris is different and uses: /etc/vfstab). Each system has slightly different format. Linux fields:

1. device (/dev/hda2) or filesystem label (LABEL=/home). Using labels works well with removable drives (and is default for Red Hat). Use the blkid command to view the volume labels and UUIDs.

2. mount point (/home).

3. type (can be auto).

4. list of mount options to know: noauto (only mount manually), ro (read only), noatime (useful for flash drives), un-hide (iso9660 only), nodev, nosuid, noexec (enhances security), defaults (same as rw, suid, dev, exec, auto, nouser, async) , sync (write changes immediately, useful with removable media), remount (change mount options without actually unmounting and remounting), acl (enable ACLs on the filesystem), usrquota, grpquota (enable quota processing and limit enforcement), ... These options affect who (other than root) can mount a device, and only apply if used in the fstab file: user (allow anyone to mount, that user to unmount), users (allow anyone to mount or unmount), owner (allow owner of device to mount, unmount), and group (allow members of the group of device to mount, unmount). See man mount for a complete list of options, including filesystem specific options,

5. 1 or 0 (used by dump program to control which FSes to backup, 1=yes).

6. 0, 1, 2, ... (used by fsck: 0=don’t check (e.g., FAT fs), else check from low to high, same num means can be checked in parallel, so common use is: 1=root fs, 2=other FSes).

/etc/mtab holds list of mounted filesystems, similar to /proc/mounts. (Show; point out two root filesystems in /proc/mounts due to initrd.)

NFS and mount. Describe “fake” mounted filesystems: tmpfs, procfs, sysfs (replacement for proc), devfs (/etc/devfsd.conf), udev (repl. for devfs), ...

Other commands: fuser -cu mntPt (POSIX; limited but useful functionality), lsof +D dir (Gnu), df, du, [rep]quota (discussed in detail later). Gnu has -h option for nicer output units; BSD setenv BLOCKSIZE=k or m.

Auto‑mounters (Solaris: /home is controlled by auto-mounter, turn this off or you can’t use /home) (Demo: vi /etc/auto.*; init.d/autofs start; ls /misc/floppy).

Important disk related commands: [g]parted, [cs]fdisk (see page 17), mkfs (mkfs.{ext2,msdos}, mke2fs, mkdosfs), mkswap, fsck, (e2fsck) lost+found, badblocks, tune*fs and *tune (shows and edits superblock; for UFS on Solaris, use tunefs), debugfs. (Use:
tune2fs -i #[d|w|m], ‑T YYYYMMDD[[HHMM]SS] | now, and also use ‑c max_count, ‑C current_count

The ext[234] tool badblocks runs e2fsck with the ‑c option. This reads the disk and any bad blocks found is added to a special badblock file, so those blocks will never be used for other purposes. Note this is different from the low level formatting that maps out bad blocks at the disk driver level.

fsck

All data in disks suffer from corruption over time. This can happen with power outages/glitches, EMF (including from sun-spots!), software errors in device drivers and elsewhere in the kernel, and aging hardware that simply breaks down over time. This corruption is sometimes called bit-rot. To address this a system can use journaling, redundant data, checksums, and RAID hardware. In spite of all that errors can still occur.

Periodically a filesystem can be checked (“fsck” means filesystem check), and in some cases automatically repaired. If a filesystem isn’t unmounted cleanly, fsck will be run on it automatically at the next boot.

ZFS runs a daemon to continuously scan the checksums for errors (and if possible fix the error), a process called disk scrubbing. Most systems only check when you run fsck.

To fsck a filesystem it must be unmounted or mounted as read-only. If a filesystem is in use there is no guarantee the data is valid (due to RAM caching). Worse, the automatic sync-ing of the cache will re-corrupt the filesystem. To check the root FS, boot with it mounted read-only; pass the “ro” (F10: readonlyroot) kernel option via grub. Or boot from a Live CD and then run fsck. This usually means server down-time during the check.

Fun fsck facts:

· fsck may require many passes to fix a disk. You must keep running fsck until it reports no more errors found.

· Even a journaling filesystem can become corrupted (when the journal file itself has errors). Usually however, fsck on such systems will do nothing if the journal is okay and will run very quickly.

· After repairing a disk you must reboot it at once, or the parts of the filesystem cached in memory will be written back to the disk and corrupt it again.

It can take from seconds to hours to run fsck! In some cases you can use an option to have fsck check filesystems in parallel (except for root). For journaling filesystems (such as ext3), fsck by default applies any outstanding journal transaction in its log file. If that completes successfully then fsck will report no errors and stop. (You can add an option to force a full check anyway.)

There is a point in the boot process when the root volume is still mounted read-only, and nothing else is mounted but the system is initialized sufficiently to run fsck. The trick is to get the system to pause at this point, or to automatically run fsck for you at this point.

Some Unixes (Solaris) has a shutdown(8) option to force fsck to run at the next boot. Linux doesn’t have that option. The easy way on most Linux systems is to “touch /forcefsck” and reboot, then delete that file. Use “tune2fs” to tweak mount settings to force fsck to run on a schedule.

One way is to edit the boot script that runs, to execute /bin/sh at the right point. Then run fsck and other commands. When that shell exits the boot process continues. (Then remove the /bin/sh command from that script.)

The boot script on Fedora (/etc/rc.d/rc.sysinit) will check for various mount options and check for the presence of special files, to run fsck at the right time. This is certainly the easiest way to do this! Reading the script with “less” and searching for “fsck” reveals you can add this to the grub kernel arguments to force fsck: forcefsck

You can un-mount (with “umount”) most filesystems if there’re not busy. But you may find some filesystems are busy (the one holding /var/log/* for example) and those can't be un-mounted until you stop (“kill”) the process using it. One way to find those is: fuser /var/log/*

As for the root filesystem, you can remount it as read-only with the correct mount options. The command is: mount ‑no remount,ro /

Now you can run fsck, then reboot normally.

One point to note is that the output of “mount” won’t show the root filesystem as read-only (“ro”), it will still show it as “rw”! This is because that status is saved in the file /etc/mtab which is updated when you run mount. But, once you change the root filesystem to read-only, /etc/mtab can’t be updated! So the old “rw” status gets shown. However the system does know it is mounted as read-only; view /proc/mounts to see this.

You can resize a disk partition with parted (best and safest practice is to use a gparted live CD to boot from). You can resize many filesystem types, including ext2/3 (use resize2fs). To grow a filesystem, first grow the partition. To shrink a partition, first shrink the filesystem in it. (And always back up first!) Newer NTFS (since ’03) have immobile bits, preventing resizing. Use diskeeper (diskeeper.com) or PerfectDisk (raxco.com) to defrag first. Review FHS (www.pathname.com/fhs).

See also Solaris filesystem(5) & Linux hier(7).

An image file can be attached to a loopback device, which in turn appears like a disk to the system and can be mounted. On Linux you would use:
losetup /dev/loop0 myfs.img, then mount loop0.

On Solaris you would use lofiadm ‑a myfs.img (which would them attach myfs.img to /dev/lofi/num). Then mount this loopback device someplace:

            mount -F hsfs -o ro /dev/lofi/1 /mnt # Solaris CD-ROM (ISO) image
            mount -F ufs -o ro /dev/lofi/2 /mnt   # Std. Solaris image
            mount -t ext2 /dev/loop0 /mnt          # Linux image

/dev/cdrom, /dev/dvd and others are common symlinks to system standard devices. Using standard names, many commands don’t need to care which device is actually used. However the SA must make sure these links exists and are correct.

Make swap space available with mkswap, enable with swapon/swapoff. (Usually swap entries in fstab are enabled early in the boot process by the system boot scripts.)

SMART (Self-Monitoring, Analysis, and Reporting Technology)

Most ATA and SCSI disks today support SMART. This can be used to monitory your disk drives and alert the SA to (potential) disk problems. You must have a disk and OS that supports SMART; get from smartmontools.sourceforge.net. These include smartd (the daemon that monitors disks) and smartctl (to query disk information and to set some disk flags).

Control smartd by editing /etc/smartd.conf. With smartctl the most important command is -H which give a quick disk head check. Use ‑i to see disk information; this shows if the disk supports SMART and if it is on.

smartctl -{iH} device and smartd can be used to check, control and monitor disks (demo). SMART is incompatible with most RAID systems.

SMART support varies widely between drives. Often SMART data is kept is a small RAM buffer, and unless you read it frequently you can see a false picture!

Using Removable Storage Media

From the GUI, use floppy (today: flash drives), CD, DVD, ... icons (after inserting media) to access disk. (The GUI automatically mounts the media.) Use right-click on icon to chose “eject” or just “unmount” before removing the media!

From the command line use mtools package: DOS commands prepended by “m”: mdir a: or mcopy a:foo.txt. (can configure which drive letter means which drive.)

Also can use an automounter: ls /misc/floppy.

Finally can just use mount and umount. Normally only root and users logged in at the console can access removable media. Floppy device: /dev/fd0. Flash (thumb) drive is usually /dev/sda1. CD is often /dev/cd and a DVD drive is usually /dev/dvd.

Rack Mounted Servers

While not part of this course, you should know there are standards for building reliable data/computing centers. These standards include EIA/TIA 568B, and the network equipment building standard (NEBS) in the U.S. and European Telecommunications Standards Institute (ETSI) standard. Today there are excellent boots and on-line resources for this, even certifications and degrees. (See the Data Center Univ. by APC) In this course we will discuss rack-mounting and a few related topics only.

Racks can be free-standing (floor mounts) or wall mounts. They may be completely enclosed or completely open. Some racks use only two supports (centered on the left and right sides of the shelves) and others use 4 corner supports. The primary design criteria are:

1. Access to equipment — Various enclosures, locks, and latches restrict access.

2. Airflow — Cabinets are designed to be placed side-to-side, so airflow is vertical, with vents and mounting brackets for fans. The shelving needs to support this too.

3. Mounting brackets — Mounting brackets have mounting holes at standard spacing and are a standard distance apart, to allow a variety of equipment to be installed in several configurations.

4. Grounding — The mounting brackets are conductive, acting as grounding strips for the cabinet and equipment, allowing the whole cabinet to be connected to the building ground.

5. Cable access — The bottom of the cabinet is usually open, allowing external cables to drop through a raised floor.

6. Noise reduction — Built into some cabinets.

The most common type of rack mount cabinet is known as “EIA standard”. Most rack mounted computer equipment is standardized to a 19" width. The internal width of these rack mount enclosures is also “EIA standard” of 19". There are many different server racks sizes (heights and depths) and different types of shelving systems that can be put in to these enclosures. The form of the modern cabinet is standardized by the Electronic Industries Alliance so that any EIA standard equipment can be placed in any manufacturer’s EIA standard cabinet.

All computer equipment heights are measured in units called rack units or RUs or simply “U’s”. One U (“1U”) equals 1.75 inches. When shopping for rack mount cabinets you will see references to 10u, 12u, 25u, etc. cabinets.

For example: If looking at a 25U rack mount cabinet, you need to choose some shelving that will fit into this cabinet for my specific application. The shelves for these cabinets are also rated in U’s. If you choose shelving that has a rating of 5 U’s, then this particular server rack will accommodate 5 shelves. If you choose shelves that have a rating of 12 U’s each, then this rack mount enclosure will only accommodate 2 such shelves.

Note most racks reserve some space for a power supply, fans, etc.

Airflow is a big issue when it comes to enclosed rack mount cabinets. Dissipating heat becomes critical when storing multiple devices inside of a single cabinet. This is certainly something you will need to take in to consideration when purchasing your server rack cabinet. You will also want to consider noise, power supplies, locks, and cable management. It is often best to use separate racks for hard disk arrays, with extra noise and vibration damping.