NAS, SAN, and AoE
Static web content, sound and video graphics, application server code, and other data is often needed by a number of servers. Using a central disk storage system that all servers can access can make a lot of sense in these cases. (For one thing, the individual web servers can be smaller and cheaper, often reduced to a blade server.) Such a centralized storage system is fundamental to cluster and grid computing solutions. (HCC uses a rack-mounted disk storage device capable of operating dozens of plug-in hard drives.)
Benefits of Centralized Storage
Administrating all of the storage resources in high-growth and mission-critical environments can be daunting and very expensive. Centralizing data storage operations and their management has many benefits. Centralizing storage can dramatically reduce the management costs and complexity of these environments while providing significant technical advantages.
Providing large increases in storage performance, state-of-the-art reliability and scalability are primary benefits. Performance of centralized storage can be much higher than traditional direct attached storage because of the very high data transfer rates of the electrical interfaces used to connect devices in a SAN (such as Fibre Channel).
Performance gains arise from this flexible architecture, such as load balancing and LAN-free backup.
Even storage reliability and availability can be greatly enhanced using techniques such as:
· Redundant I/O paths
· Server clustering
· Run-time data replication (local and/or remote)
Adding storage capacity and other storage resources can be accomplished easily usually without the need to shut down or even quiese (stop disk activity without a complete shutdown) the server(s) or their client networks.
These features can quickly add up to large cost savings, fewer network outages, painless storage expansion, and reduced network loading.
Storage management software can provide many features, including:
· Storage Management
· Storage Monitoring (including "phone home" notification features)
· Storage Configuration
· Redundant I/O Path Management
· LUN Masking and Assignment
· Serverless Backup (a.k.a. 3rd party copying)
· Data Replication (both local and remote)
· Shared Storage (including support for heterogeneous platform environments)
· RAID configuration and management
· Volume and file system management (creating software RAID volumes on JBODs (just a bunch of disks), changing RAID levels “on-the-fly”, spanning disk drives or RAID systems to form larger contiguous logical volumes, file system journaling for higher efficiency and performance)
A number of technologies exists to allow outside the box disk storage. However, no matter which storage solution is chosen some technologies are common to all. These include SCSI and RAID.
RAID on Linux is managed by md (multi-disk) devices. Note this applies to kernel-managed software
RAID only, and should only be used for RAID 0 (stripping). (And since LVM supports stripping directly,
you don’t even need this!) Hardware RAID
is seen by the kernel as a single disk. The
mdadm configuration file tells
an mdadm process running in monitor
mode how to manage the hot spares, so that they’re ready to replace any failed
disk in any mirror. See spare groups in the mdadm man page.
The startup and shutdown scripts for RAID are easy to
create. The startup script simply
assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process. The
shutdown script stops the mdadm
monitor, stops the RAID 0s and, finally, stops the mirrors.
For years SCSI has been providing a
high speed, reliable method for data storage. Although there have been many different SCSI
standards in the past, it is now the storage technology of choice. RAID (and JBOD) are
standard ways to provide integrity and availability to data. These technologies store multiple copies of
the data across several disks and provide parity
information.
DAS (Direct Attached Storage)
Direct attached storage is the term used to describe a storage device that is directly attached to a host system. The simplest example of DAS is the internal hard drive of a server computer, though storage devices housed in an external box come under this banner as well. For one example, my first Macintosh computer used SCSI external busses to connect disks with the host. This option is still available but misses the benefits of centralized, off-box storage. Each such disk is still attached to a single server.
SANs (Storage Area Networks)
A storage area network (SAN) is a dedicated network that is separate from other LANs and WANs. It generally serves to interconnect the storage-related resources that are connected to one or more servers. It is often characterized by its high interconnection data rates (Gigabits/sec) between member storage peripherals and by its highly scalable architecture.
Fibre Channel serves as the de facto standard being used in most SANs. Fibre Channel is an industry-standard interconnect and high-performance serial I/O protocol that is media independent and supports simultaneous transfer of many different protocols. Additionally, SCSI interfaces are frequently used as sub-interfaces between internal components of SAN members, such as between raw storage disks and a RAID controller.
Fibre Channel is a technology used to
interconnect storage devices allowing them to communicate at very high speeds
(up to 10Gbps in future implementations). As well as being faster than more traditional
storage technologies like SCSI, Fibre Channel also allows for devices to be
connected over a much greater distance. In
fact, Fibre Channel can be used up to six miles. (The storage devices are connected to Fibre
Channel switch using either multimode or single mode fiber
optic cable. Multimode is used for
short distances (up to 2 kilometers) and is cheaper, single mode is used for
longer distances.) This allows devices
in a SAN to be placed in the most appropriate physical location.
As with many IT technologies, SANs depend on new and developing standards to ensure seamless interoperability between their member components. SAN hardware components such as Fibre Channel hubs, switches, host bus adapters, bridges and RAID storage systems rely on many adopted standards for their connectivity.
NAS (Network Attached Storage)
NAS follows a client/server design. A single hardware device, often called the NAS box or NAS head, acts as the interface between the NAS and network clients. (Occasionally some sort of gateway server is used in the middle.)
These NAS devices require no monitor, keyboard or mouse. They generally run an embedded operating system rather than a full-featured NOS. One or more disk (and possibly tape) drives can be attached to many NAS systems to increase total capacity. Clients always connect to the NAS head, however, rather than to the individual storage devices.
Clients generally access a NAS over an Ethernet connection. The NAS appears on the network as a single "node" that is the IP address of the head device.
A NAS can store any data that appears in the form of files, such as email boxes, Web content, remote system backups, and so on. Overall, the uses of a NAS parallel those of traditional file servers.
The attraction of NAS is that in an environment with many servers running different operating systems, storage of data can be centralized, as can the security, management, and backup of the data. An increasing number of companies already make use of NAS technology, if only with devices such as CD-ROM towers (stand-alone boxes that contain multiple CD-ROM drives) that are connected directly to the network.
Some of the big advantages of NAS include:
· Expandability (Need more storage space? Just add another NAS device)
· Fault tolerance (In a DAS environment, a server going down means that the data that that server holds is no longer available. With NAS, the data is still available on the network and accessible by clients. Fault tolerant measures such as RAID can be used to make sure that the NAS device does not become a point of failure.)
· Security (NAS devices either provide file system security capabilities of their own, or allowing user databases on NOS to be used for authentication purposes.)
NAS devices operate independently of network servers and communicate directly with the client, this means that in the event of a network server failure, clients will still be able to access files stored on a NAS device.
The NAS device maintains its own file system and accommodates industry standard network protocols such as TCP/IP to allow clients to communicate with it over the network. To facilitate the actual file access, NAS devices will accommodate one or more of the common file access protocols including SMB, CIFS, HTTP, and NFS.
SAN versus NAS
At a high level, Storage Area Networks (SANs) serve the same purpose as a NAS system. A SAN supplies data storage capability to other network devices.
A SAN is a network of storage devices that are connected to each other and to a server (or cluster of servers), which acts as an access point to the SAN. (In some configurations a SAN is also connected to the network, which makes it appear more like a NAS.)
SANs use special
switches as a mechanism to connect the devices. These switches look a lot like a normal
Ethernet networking switch.
Traditional SANs differed from
traditional NAS in several ways. SANs
are small separate networks dedicated to storage devices, while a NAS device is
simply a storage subsystem that is connected to the LAN network media like any
other server.
SANs often utilized Fibre Channel rather than Ethernet, and a SAN often incorporated multiple network devices or “endpoints” on a self-contained or “private” LAN, whereas NAS relied on individual devices connected directly to the existing public LAN. The traditional NAS system is a simpler network storage solution, effectively a subset of a full SAN implementation.
The distinction between NAS and SAN has grown fuzzy in recent times, as technology companies continue to invent and market new network storage products. Today’s SANs sometimes use Ethernet, NAS systems sometimes use Fibre Channel, and NAS systems sometimes incorporate private networks with multiple endpoints.
The primary differentiator between NAS and SAN products now boils down to the choice of network protocol. SAN systems transfer data over the network in the form of disk blocks (fixed-sized file chunks, using low-level storage protocols like SCSI) whereas NAS systems operate at a higher level with the file itself.
Today the most common vendor for NAS/SAN products is NetApp. Another is Exanet.)
AoE (ATA over Ethernet)
AoE is seen as a replacement for traditional SANS using Fibre Channel, and for iSCSI, (SCSI over TCP/IP) which is itself a replacement for Fibre Channel. AoE converts parallel ATA signals to serialized Ethernet format, which enables ATA disk storage to be remotely accessed over an Ethernet LAN in an ATA (IDE) compatible manner. Think of AoE as replacing the IDE cable in the computer with an Ethernet cable. A big advantage of AoE is that it makes use of standard, inexpensive, ATA (IDE) hard drives commonly used in desktop PCs.
Each AoE packet carries a command for an ATA drive or the response from the ATA drive. The AoE driver (in the OS kernel) performs AoE and makes the remote disks available as normal block devices, such as /dev/etherd/e0.0, just as the IDE driver makes the local drive at the end of your IDE cable available as /dev/hda. The driver retransmits packets when necessary, so the AoE devices look like any other disks to the rest of the kernel.
In addition to ATA commands, AoE has a simple facility for identifying available AoE devices using query config packets. That's all there is to it: ATA command packets and query config packets.
AoE security is provided by the fact that AoE is not routable. You easily can determine what computers see what disks by setting up ad hoc Ethernet networks (say using VLANs). Because AoE devices don’t have IP addresses, it is trivial to create isolated Ethernet networks.
View Example setup from Excellent Linux Journal article (which was adapted for the following): http://www.linuxjournal.com/article/8149
You buy some equipment, paying a bit less than $6,500 US
for all of the following:
·
One dual-port gigabit
Ethernet card to replace the old 100Mb card in his server.
·
One 26-port network switch
with two gigabit ports.
·
One Coraid EtherDrive shelf
and ten EtherDrive blades.
·
Ten 400GB ATA drives.
The shelf of ten blades takes up three rack units. Each EtherDrive blade is a small computer that
performs the AoE protocol to effectively put one ATA disk on the LAN. Striping data over the ten blades in the shelf
results in about the throughput of a local ATA drive, so the gigabit link helps
to use the throughput effectively. Although
he could have put the EtherDrive blades on the same network as everyone else,
he has decided to put the storage on its own network, connected to the server's
second network interface, eth1, for security and performance.
Use a RAID 10 (striping over mirrored pairs configuration),
a.k.a. RAID 1+0). Although this
configuration doesn’t result in as much usable capacity as a RAID 5
configuration, RAID 10 maximizes reliability, minimizes the CPU cost of
performing RAID and has a shorter array re-initialization time if one disk
should fail. The RAID 10 in this case
has four stripe elements, each one part of a mirrored pair of drives.
Use a JFS filesystem on a logical volume. The logical volume resides, for now, on only
one physical volume. That physical
volume is the RAID 10 block device.
The RAID 10 is created from the EtherDrive storage blades
in the Coraid shelf using Linux software RAID. Later, you can buy another full shelf, create
another RAID 10, make it into a physical volume and use the new physical volume
to extend the logical volume where his JFS lives.
A program that wants to use a device typically does so by
opening a special file corresponding to that device. A familiar example is the /dev/hda file. An ls ‑l command shows two numbers for /dev/hda, 3
and 0. The major number is 3 and the
minor number is 0. The /dev/hda1 file has a minor number of 1, and the major number is
still 3.
Until kernel 2.6, the minor number was eight bits in
size, limiting the possible minor numbers to 0 through 255. Nobody had that many devices, so the
limitation didn’t matter. Now that disks
have been decoupled from servers, it does matter, and kernel 2.6 uses 20 bits
for the minor device number.
Having 1,048,576 values for the minor number is a big
help to systems that use many devices, but not all software has caught up. If glibc or a specific application still
thinks of minor numbers as eight bits in size, you are going to have trouble
using minor device numbers over 255.
To help during this transitional period, the AoE driver
may be compiled without support for partitions. That way, instead of there being 16 minor
numbers per disk, there's only one per disk. So even on systems that haven’t caught up to
the large minor device numbers of 2.6, you still can use up to 256 AoE disks.
LVM now needs a couple of tweaks made to its
configuration. For one, it needs a line
with types = [ "aoe", 16 ] so
that LVM recognizes AoE disks. Next, it
needs md_component_detection = 1, so the
disks inside RAID 10 are ignored when the whole RAID 10 becomes a physical
volume.
Expanding Storage
To expand the filesystem without unmounting it, set up a
second RAID 10 array, add it to the volume group and then increase the filesystem.
[Listing 3]
# after setting up a RAID 10 for the second shelf
# as /dev/md5, add it to the volume group
vgextend ben /dev/md5
vgdisplay ben | grep -i 'free.*PE'
# grow the logical volume and then the jfs
lvextend --extents +88349 /dev/ben/franklin
mount -o remount,resize /bf
Throughput Estimation
In
general, you can estimate the throughput of a collection of EtherDrive blades by
considering how many stripe elements there are. For RAID 10, there are half as many stripe
elements as disks, because each disk is mirrored on another disk. For RAID 5, there effectively is one disk
dedicated to parity data, leaving the rest of the disks as stripe elements.
The expected read throughput is the number of stripe
elements times 6MB/s. That means if you
bought two shelves initially and constructed an 18-blade RAID 10 instead of an
8-blade RAID 10, you would expect to get a little more than twice the
throughput.
What would happen if another host had access to the
storage network. Could that second host
mount the JFS filesystem and access the same data? The short answer is, "Not safely". JFS, like ext3 and most filesystems, is
designed to be used by a single host. For these single-host filesystems, filesystem
corruption can result when multiple hosts mount the same block storage device. The reason is the buffer cache, which is
unified with the page cache in 2.6 kernels.
Linux aggressively caches filesystem data in RAM whenever
possible in order to avoid using the slower block storage, gaining a
significant performance boost. You've
seen this caching in action if you've ever run a find
command twice on the same directory.
Some filesystems are designed to be used by multiple
hosts. Cluster filesystems, as they are
called, have some way of making sure that the caches on all of the hosts stay
in sync with the underlying filesystem. GFS
is a great open-source example. GFS uses
cluster management software to keep track of who is in the group of hosts
accessing the filesystem. It uses
locking to make sure that the different hosts cooperate when accessing the
filesystem.
By using a cluster filesystem such as GFS, it is possible
for multiple hosts on the Ethernet network to access the same block storage
using ATA over Ethernet. There's no need
for anything like an NFS server, because each host accesses the storage
directly, distributing the I/O nicely. But
there's a snag. Any time you're using a
lot of disks, you're increasing the chances that one of the disks will fail. Usually you use RAID to take care of this
issue by introducing some redundancy. Unfortunately, Linux software RAID is not
cluster-aware. That means each host on
the network cannot do RAID 10 using mdadm and have things simply work out.
Cluster software for Linux is developing at a furious
pace. I believe we'll see good
cluster-aware RAID within a year or two. Until then, there are a few options for
clusters using AoE for shared block storage. The basic idea is to centralize the RAID
functionality. You could buy a Coraid
RAIDblade or two and have the cluster nodes access the storage exported by
them. The RAIDblades can manage all the
EtherDrive blades behind them. Or, if
you're feeling adventurous, you also could do it yourself by using a Linux host
that does software RAID and exports the resulting disk-failure-proofed block
storage itself, by way of ATA over Ethernet. Check out the vblade program (see Resources)
for an example of software that exports any storage using ATA over Ethernet.
Because ATA over Ethernet puts inexpensive hard drives on
the Ethernet network, some sysadmins might be interested in using AoE in a
backup plan. Often, backup strategies
involve tier-two storage-storage that is not quite as fast as on-line storage
but also is not as inaccessible as tape. ATA over Ethernet makes it easy to use cheap
ATA drives as tier-two storage.
But with hard disks being so inexpensive and seeing that
we have stable software RAID, why not use the hard disks as a backup medium? Unlike tape, this backup medium supports
instant access to any archived file. To
perform the backup safely on a live system, use LVM’s snapshot abilities.
Several new backup software products are taking advantage
of filesystem features for backups. By
using hard links, they can perform multiple full backups with the efficiency of
incremental backups. Check out the
Backup PC and rsync backups links in the on-line Resources for more
information.
http://www.enterprisestorageforum.com/technology/features/article.php/947551
http://compnetworking.about.com/od/itinformationtechnology/l/aa070101b.htm
http://www.webopedia.com/TERM/N/network-attached_storage.html
http://www.enterprisestorageforum.com/sans/features/article.php/990871
http://en.wikipedia.org/wiki/ATA_over_Ethernet
http://www.linuxjournal.com/article/8149 (Used
extensively for the AoE part of this page!)