NAS, SAN, and AoE

Static web content, sound and video graphics, application server code, and other data is often needed by a number of servers.  Using a central disk storage system that all servers can access can make a lot of sense in these cases.  (For one thing, the individual web servers can be smaller and cheaper, often reduced to a blade server.)  Such a centralized storage system is fundamental to cluster and grid computing solutions.  (HCC uses a rack-mounted disk storage device capable of operating dozens of plug-in hard drives.)

Benefits of Centralized Storage

Administrating all of the storage resources in high-growth and mission-critical environments can be daunting and very expensive.  Centralizing data storage operations and their management has many benefits.  Centralizing storage can dramatically reduce the management costs and complexity of these environments while providing significant technical advantages.

Providing large increases in storage performance, state-of-the-art reliability and scalability are primary benefits.  Performance of centralized storage can be much higher than traditional direct attached storage because of the very high data transfer rates of the electrical interfaces used to connect devices in a SAN (such as Fibre Channel).

Performance gains arise from this flexible architecture, such as load balancing and LAN-free backup.

Even storage reliability and availability can be greatly enhanced using techniques such as:

·        Redundant I/O paths

·        Server clustering

·        Run-time data replication (local and/or remote)

Adding storage capacity and other storage resources can be accomplished easily usually without the need to shut down or even quiese (stop disk activity without a complete shutdown) the server(s) or their client networks.

These features can quickly add up to large cost savings, fewer network outages, painless storage expansion, and reduced network loading.

Storage management software can provide many features, including:

·        Storage Management

·        Storage Monitoring (including "phone home" notification features)

·        Storage Configuration

·        Redundant I/O Path Management

·        LUN Masking and Assignment

·        Serverless Backup (a.k.a. 3rd party copying)

·        Data Replication (both local and remote)

·        Shared Storage (including support for heterogeneous platform environments)

·        RAID configuration and management

·        Volume and file system management (creating software RAID volumes on JBODs (just a bunch of disks), changing RAID levels “on-the-fly”, spanning disk drives or RAID systems to form larger contiguous logical volumes, file system journaling for higher efficiency and performance)

A number of technologies exists to allow outside the box disk storage.  However, no matter which storage solution is chosen some technologies are common to all.  These include SCSI and RAID.

RAID on Linux is managed by md (multi-disk) devices.  Note this applies to kernel-managed software RAID only, and should only be used for RAID 0 (stripping).  (And since LVM supports stripping directly, you don’t even need this!)  Hardware RAID is seen by the kernel as a single disk.  The mdadm configuration file tells an mdadm process running in monitor mode how to manage the hot spares, so that they’re ready to replace any failed disk in any mirror.  See spare groups in the mdadm man page.

The startup and shutdown scripts for RAID are easy to create.  The startup script simply assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process.  The shutdown script stops the mdadm monitor, stops the RAID 0s and, finally, stops the mirrors.

For years SCSI has been providing a high speed, reliable method for data storage.  Although there have been many different SCSI standards in the past, it is now the storage technology of choice.  RAID (and JBOD) are standard ways to provide integrity and availability to data.  These technologies store multiple copies of the data across several disks and provide parity information.

DAS (Direct Attached Storage)

Direct attached storage is the term used to describe a storage device that is directly attached to a host system.  The simplest example of DAS is the internal hard drive of a server computer, though storage devices housed in an external box come under this banner as well.  For one example, my first Macintosh computer used SCSI external busses to connect disks with the host.  This option is still available but misses the benefits of centralized, off-box storage.  Each such disk is still attached to a single server.

SANs (Storage Area Networks)

A storage area network (SAN) is a dedicated network that is separate from other LANs and WANs.  It generally serves to interconnect the storage-related resources that are connected to one or more servers.  It is often characterized by its high interconnection data rates (Gigabits/sec) between member storage peripherals and by its highly scalable architecture.

Fibre Channel serves as the de facto standard being used in most SANs.  Fibre Channel is an industry-standard interconnect and high-performance serial I/O protocol that is media independent and supports simultaneous transfer of many different protocols.  Additionally, SCSI interfaces are frequently used as sub-interfaces between internal components of SAN members, such as between raw storage disks and a RAID controller.

Fibre Channel is a technology used to interconnect storage devices allowing them to communicate at very high speeds (up to 10Gbps in future implementations).  As well as being faster than more traditional storage technologies like SCSI, Fibre Channel also allows for devices to be connected over a much greater distance.  In fact, Fibre Channel can be used up to six miles.  (The storage devices are connected to Fibre Channel switch using either multimode or single mode fiber optic cable.  Multimode is used for short distances (up to 2 kilometers) and is cheaper, single mode is used for longer distances.)  This allows devices in a SAN to be placed in the most appropriate physical location.

As with many IT technologies, SANs depend on new and developing standards to ensure seamless interoperability between their member components.  SAN hardware components such as Fibre Channel hubs, switches, host bus adapters, bridges and RAID storage systems rely on many adopted standards for their connectivity.

NAS (Network Attached Storage)

NAS follows a client/server design.  A single hardware device, often called the NAS box or NAS head, acts as the interface between the NAS and network clients.  (Occasionally some sort of gateway server is used in the middle.)

These NAS devices require no monitor, keyboard or mouse.  They generally run an embedded operating system rather than a full-featured NOS.  One or more disk (and possibly tape) drives can be attached to many NAS systems to increase total capacity.  Clients always connect to the NAS head, however, rather than to the individual storage devices.

Clients generally access a NAS over an Ethernet connection. The NAS appears on the network as a single "node" that is the IP address of the head device.

A NAS can store any data that appears in the form of files, such as email boxes, Web content, remote system backups, and so on. Overall, the uses of a NAS parallel those of traditional file servers.

The attraction of NAS is that in an environment with many servers running different operating systems, storage of data can be centralized, as can the security, management, and backup of the data.  An increasing number of companies already make use of NAS technology, if only with devices such as CD-ROM towers (stand-alone boxes that contain multiple CD-ROM drives) that are connected directly to the network.

Some of the big advantages of NAS include:

·        Expandability (Need more storage space?  Just add another NAS device)

·        Fault tolerance (In a DAS environment, a server going down means that the data that that server holds is no longer available.  With NAS, the data is still available on the network and accessible by clients.  Fault tolerant measures such as RAID can be used to make sure that the NAS device does not become a point of failure.)

·        Security  (NAS devices either provide file system security capabilities of their own, or allowing user databases on NOS to be used for authentication purposes.)

NAS devices operate independently of network servers and communicate directly with the client, this means that in the event of a network server failure, clients will still be able to access files stored on a NAS device.

The NAS device maintains its own file system and accommodates industry standard network protocols such as TCP/IP to allow clients to communicate with it over the network.  To facilitate the actual file access, NAS devices will accommodate one or more of the common file access protocols including SMB, CIFS, HTTP, and NFS.

SAN versus NAS

At a high level, Storage Area Networks (SANs) serve the same purpose as a NAS system.  A SAN supplies data storage capability to other network devices.

A SAN is a network of storage devices that are connected to each other and to a server (or cluster of servers), which acts as an access point to the SAN.  (In some configurations a SAN is also connected to the network, which makes it appear more like a NAS.)

SANs use special switches as a mechanism to connect the devices.  These switches look a lot like a normal Ethernet networking switch.

Traditional SANs differed from traditional NAS in several ways.  SANs are small separate networks dedicated to storage devices, while a NAS device is simply a storage subsystem that is connected to the LAN network media like any other server.

SANs often utilized Fibre Channel rather than Ethernet, and a SAN often incorporated multiple network devices or “endpoints” on a self-contained or “private” LAN, whereas NAS relied on individual devices connected directly to the existing public LAN.  The traditional NAS system is a simpler network storage solution, effectively a subset of a full SAN implementation.

The distinction between NAS and SAN has grown fuzzy in recent times, as technology companies continue to invent and market new network storage products.  Today’s SANs sometimes use Ethernet, NAS systems sometimes use Fibre Channel, and NAS systems sometimes incorporate private networks with multiple endpoints.

The primary differentiator between NAS and SAN products now boils down to the choice of network protocol.  SAN systems transfer data over the network in the form of disk blocks (fixed-sized file chunks, using low-level storage protocols like SCSI) whereas NAS systems operate at a higher level with the file itself.

Today the most common vendor for NAS/SAN products is NetApp.  Another is Exanet.)

AoE (ATA over Ethernet)

AoE is seen as a replacement for traditional SANS using Fibre Channel, and for iSCSI, (SCSI over TCP/IP) which is itself a replacement for Fibre Channel.  AoE converts parallel ATA signals to serialized Ethernet format, which enables ATA disk storage to be remotely accessed over an Ethernet LAN in an ATA (IDE) compatible manner.  Think of AoE as replacing the IDE cable in the computer with an Ethernet cable.  A big advantage of AoE is that it makes use of standard, inexpensive, ATA (IDE) hard drives commonly used in desktop PCs.

Each AoE packet carries a command for an ATA drive or the response from the ATA drive.  The AoE driver (in the OS kernel) performs AoE and makes the remote disks available as normal block devices, such as /dev/etherd/e0.0, just as the IDE driver makes the local drive at the end of your IDE cable available as /dev/hda.  The driver retransmits packets when necessary, so the AoE devices look like any other disks to the rest of the kernel.

In addition to ATA commands, AoE has a simple facility for identifying available AoE devices using query config packets. That's all there is to it: ATA command packets and query config packets.

AoE security is provided by the fact that AoE is not routable.  You easily can determine what computers see what disks by setting up ad hoc Ethernet networks (say using VLANs).  Because AoE devices don’t have IP addresses, it is trivial to create isolated Ethernet networks.

View Example setup from Excellent Linux Journal article (which was adapted for the following):

You buy some equipment, paying a bit less than $6,500 US for all of the following:

·         One dual-port gigabit Ethernet card to replace the old 100Mb card in his server.

·         One 26-port network switch with two gigabit ports.

·         One Coraid EtherDrive shelf and ten EtherDrive blades.

·         Ten 400GB ATA drives.

The shelf of ten blades takes up three rack units.  Each EtherDrive blade is a small computer that performs the AoE protocol to effectively put one ATA disk on the LAN.  Striping data over the ten blades in the shelf results in about the throughput of a local ATA drive, so the gigabit link helps to use the throughput effectively.  Although he could have put the EtherDrive blades on the same network as everyone else, he has decided to put the storage on its own network, connected to the server's second network interface, eth1, for security and performance.

Use a RAID 10 (striping over mirrored pairs configuration), a.k.a. RAID 1+0).  Although this configuration doesn’t result in as much usable capacity as a RAID 5 configuration, RAID 10 maximizes reliability, minimizes the CPU cost of performing RAID and has a shorter array re-initialization time if one disk should fail.  The RAID 10 in this case has four stripe elements, each one part of a mirrored pair of drives.

Use a JFS filesystem on a logical volume.  The logical volume resides, for now, on only one physical volume.  That physical volume is the RAID 10 block device.

The RAID 10 is created from the EtherDrive storage blades in the Coraid shelf using Linux software RAID.  Later, you can buy another full shelf, create another RAID 10, make it into a physical volume and use the new physical volume to extend the logical volume where his JFS lives.

Minor Device Numbers

A program that wants to use a device typically does so by opening a special file corresponding to that device.  A familiar example is the /dev/hda file.  An ls ‑l command shows two numbers for /dev/hda, 3 and 0.  The major number is 3 and the minor number is 0.  The /dev/hda1 file has a minor number of 1, and the major number is still 3.

Until kernel 2.6, the minor number was eight bits in size, limiting the possible minor numbers to 0 through 255.  Nobody had that many devices, so the limitation didn’t matter.  Now that disks have been decoupled from servers, it does matter, and kernel 2.6 uses 20 bits for the minor device number.

Having 1,048,576 values for the minor number is a big help to systems that use many devices, but not all software has caught up.  If glibc or a specific application still thinks of minor numbers as eight bits in size, you are going to have trouble using minor device numbers over 255.

To help during this transitional period, the AoE driver may be compiled without support for partitions.  That way, instead of there being 16 minor numbers per disk, there's only one per disk.  So even on systems that haven’t caught up to the large minor device numbers of 2.6, you still can use up to 256 AoE disks.

LVM now needs a couple of tweaks made to its configuration.  For one, it needs a line with types = [ "aoe", 16 ] so that LVM recognizes AoE disks.  Next, it needs md_component_detection = 1, so the disks inside RAID 10 are ignored when the whole RAID 10 becomes a physical volume.

Expanding Storage

To expand the filesystem without unmounting it, set up a second RAID 10 array, add it to the volume group and then increase the filesystem.  [Listing 3]

# after setting up a RAID 10 for the second shelf
# as /dev/md5, add it to the volume group
vgextend ben /dev/md5
vgdisplay ben | grep -i 'free.*PE'
# grow the logical volume and then the jfs
lvextend --extents +88349 /dev/ben/franklin
mount -o remount,resize /bf

Throughput Estimation

In general, you can estimate the throughput of a collection of EtherDrive blades by considering how many stripe elements there are.  For RAID 10, there are half as many stripe elements as disks, because each disk is mirrored on another disk.  For RAID 5, there effectively is one disk dedicated to parity data, leaving the rest of the disks as stripe elements.

The expected read throughput is the number of stripe elements times 6MB/s.  That means if you bought two shelves initially and constructed an 18-blade RAID 10 instead of an 8-blade RAID 10, you would expect to get a little more than twice the throughput.

Sharing Disk Storage Between Hosts

What would happen if another host had access to the storage network.  Could that second host mount the JFS filesystem and access the same data?  The short answer is, "Not safely".  JFS, like ext3 and most filesystems, is designed to be used by a single host.  For these single-host filesystems, filesystem corruption can result when multiple hosts mount the same block storage device.  The reason is the buffer cache, which is unified with the page cache in 2.6 kernels.

Linux aggressively caches filesystem data in RAM whenever possible in order to avoid using the slower block storage, gaining a significant performance boost.  You've seen this caching in action if you've ever run a find command twice on the same directory.

Some filesystems are designed to be used by multiple hosts.  Cluster filesystems, as they are called, have some way of making sure that the caches on all of the hosts stay in sync with the underlying filesystem.  GFS is a great open-source example.  GFS uses cluster management software to keep track of who is in the group of hosts accessing the filesystem.  It uses locking to make sure that the different hosts cooperate when accessing the filesystem.

By using a cluster filesystem such as GFS, it is possible for multiple hosts on the Ethernet network to access the same block storage using ATA over Ethernet.  There's no need for anything like an NFS server, because each host accesses the storage directly, distributing the I/O nicely.  But there's a snag.  Any time you're using a lot of disks, you're increasing the chances that one of the disks will fail.  Usually you use RAID to take care of this issue by introducing some redundancy. Unfortunately, Linux software RAID is not cluster-aware.  That means each host on the network cannot do RAID 10 using mdadm and have things simply work out.

Cluster software for Linux is developing at a furious pace.  I believe we'll see good cluster-aware RAID within a year or two.  Until then, there are a few options for clusters using AoE for shared block storage.  The basic idea is to centralize the RAID functionality.  You could buy a Coraid RAIDblade or two and have the cluster nodes access the storage exported by them.  The RAIDblades can manage all the EtherDrive blades behind them.  Or, if you're feeling adventurous, you also could do it yourself by using a Linux host that does software RAID and exports the resulting disk-failure-proofed block storage itself, by way of ATA over Ethernet.  Check out the vblade program (see Resources) for an example of software that exports any storage using ATA over Ethernet.


Because ATA over Ethernet puts inexpensive hard drives on the Ethernet network, some sysadmins might be interested in using AoE in a backup plan.  Often, backup strategies involve tier-two storage-storage that is not quite as fast as on-line storage but also is not as inaccessible as tape.  ATA over Ethernet makes it easy to use cheap ATA drives as tier-two storage.

But with hard disks being so inexpensive and seeing that we have stable software RAID, why not use the hard disks as a backup medium?  Unlike tape, this backup medium supports instant access to any archived file.  To perform the backup safely on a live system, use LVM’s snapshot abilities.

Several new backup software products are taking advantage of filesystem features for backups.  By using hard links, they can perform multiple full backups with the efficiency of incremental backups.  Check out the Backup PC and rsync backups links in the on-line Resources for more information.

Links:  (Used extensively for the AoE part of this page!)