Backups and Archives

Statistic: 90% of all companies that suffer catastrophic data lost (a disk crash) are out of business within one year.

Backups are made for the purpose of rebuilding a system that is identical to the current one.  Backups are thus for recovery, not transferring of data to another system.  They do not need to be portable.  In this sense “backup” is used to mean a complete backup of an entire system: not just regular files but all owner, group, date, and permission info for all files, links, /dev entries, some /proc entries, etc.

Backups can be used not only for “bare metal” backup and recovery, but also to install many identical computers.  Clonezilla allows you to do just that, similar to the commercial product “Ghost”.  Clonezilla saves and restores only used blocks in the hard disk.  This increases the clone efficiency.  In one example, Clonezilla was used to clone one computer to 41 other computers simultaneously.  It took only about 10 minutes to clone a 5.6 GiB system image to all 41 computers via multicasting.

Archives are for transferring data to other systems, or making copies of files for historical or legal purposes.  As such, they should be portable so that they may be recovered on new systems when the original systems are no longer available.  For example, it should be possible for an archive of the files on a Solaris Unix system to be restored on an AIX Unix, or even a Linux system.  (Within limits, this portability should extend to Windows and Macintosh systems as well.)

Most of the time, the two terms are used interchangeably.  (In fact, the above definitions are not universally agreed upon!)  In the rest of this document, the term “backup” will be used to mean either a backup or an archive as defined above.  Most real-world situations call for archives, since the other objects (such as /dev entries) rarely if ever change on a production server once the system is installed and configured.  A single “backup” is usually sufficient.  For home users, the original system CDs often serve as the only backup need; all other backups are of modified files only and hence are “archives”.

Using RAID is not a replacement for regular backups!  (Imagine a disaster such as a fire on your site, an accidentally deleted file, or corrupted data.)

Creating backup policies (includes several sub-policies, discussed below) can be difficult.  Keep in mind the requirements of the organization, often specified in an SLA or service level agreement.  Make sure users/customers are aware of what gets backed up and what doesn’t, how to request a restore, and how long it might take to restore different data from various problems (very old data, a fire destroys the hardware, a DB record accidentally deleted from yesterday, ...). 

(Show this example SLA from Technion University.)

Most people underestimate how slow a restore operation can be.  It is often 10-20 times longer to restore a file than to back one up.  (One reason: operating systems are usually optimized for read operations, not write operations.)

You must also worry about security of your backups.  Have a clear policy on who is allowed to request a restore and how to do so, or else one user might request a restore of other’s files.  In some cases this may be allowed, say by a manager or auditor.  (In a small organization where everyone knows everyone, this is not likely to be a problem.)

An example SLA: Customers should be able to recover any file version (with a granularity of a day) from the past 6 months, and any file version (with a granularity of a month) from the past 3 years.  Disk failure should result in no more than 4 hours of down-time, with at worst 2 business days of data lost.  Archives will be full backups each quarter and kept forever, with old data copied to new medium when the older technology is no longer supported.  Critical data will be kept on a separate system with hourly snapshots between 7:00 AM and 7:00 PM, with a midnight snapshot made and kept for a week.  Users have access to these snapshots.  Database and financial data has a different SLA for compliance reasons.

Granularity refers to how often backups are made.  For example, suppose backups are made each night at midnight.  If some user edits a file six times in the last two days, they can’t get back any version; only the copy taken at midnight.  In some cases, you want finer granularity (e.g., versions for each hour, or every single version) and in other cases, coarser granularity is fine (versions every month).

Types and Strategies of Backups and Archives

It is possible to backup only a portion of the files (and other objects in the case of a backup) on your systems.  In fact there are three types of backups (or archives):

1.    Full (also known as “epoch” or “complete”) - everything gets backed-up.

2.    Incremental - backup everything that has been added or modified since the last backup of any type (either incremental or full).

3.    Differential - backup everything that has been added or modified since the last full backup.  Differentials can be assigned levels: level 0 is a full backup and level n is everything that has changed since the last level n-1 backup.

(Many people don’t bother to distinguish between incremental and differential.)  A system administrator must choose a backup strategy (a combination of types) based on several factors.  The factors to consider are safety, required down time, cost, convenience, and speed of backups and recovery.  These factors vary in importance for different situations.  Common strategies include full backups with incremental backups in-between, and full backups with differential backups in-between (a two-level differential).  Sometimes a three level differential is used but rarely more levels.  (You never use both incremental and differential backups as part of a single strategy.)  The strategy of using only full backups is rarely used.

What with modern backup software, the differences between the strategies mentioned above aren’t that large.  Incrementals take less time to backup and more time to restore (since several different backup media may be needed) compared with differential backups (where at most two media, the last differential and the last full backup media, are used to recover a file).  Full backups take a huge amount of time to make but recovery is very fast.  Note most commercial software keeps a special directory file that is reset for each full backup, and keeps track of which incremental tape (or disk or whatever) holds which files.  This file is read from the last incremental tape during a restore, to determine exactly which tape to use to recover some file.

The Backup Schedule

The frequency of backups (the backup schedule) is another part of the policy.  In some cases, it is reasonable to have full backups daily and incremental backups several times a day.  In other cases a full backup once a year with weekly or monthly incremental backups could be appropriate.  A common strategy suitable for most corporate environments would be monthly full backups and daily differential backups.  (Another example might be quarterly full (differential level 0) backups, with monthly level 1 differentials, and daily level 2 differentials.)  However more frequent full backups may save tapes (as the incremental backups near the end of the cycle may be too large for a single tape).

Note that in some cases there will be legal requirements for backups at certain intervals (e.g., the SEC for financial industries, the FBI for defense industries, or regulations for medical/personal data).  Depending on your backup software, it may be required to bring the system partially or completely off-line during the backup process.  Thus there is a trade-off between convenience versus cost, versus the safety of more frequent backups.

In a large organization, it may not be possible to perform a full backup on all systems on the same weekend.  A staggered schedule is needed, where (say) 1/4 of the servers get backed up on the first Sunday of the month, 1/4 the second Sunday, and so on.  Each server is still being backed up monthly but not all on the same day of the month.

Be aware that small changes to the schedule can result in dramatic changes in the amount of backup media needed.  For example, suppose you have 4GB to backup within this SLA: full backup every 4 weeks (28 days) and differential backups between.  Now assume the differential backup grows 5% per day for the first 20 days (80% has changed) and stays the same size thereafter.  Some math reveals that doing full backups each week (which still meets the SLA) will use a third the amount of the tape of a 28 day cycle, in this case.

Good schedules require a lot of complex calculation to work out (and still meet the SLAs).  Modern backup software (such as Amanda) allows one to specify the SLA and will create a schedule automatically.  A dynamic schedule will be adjusted automatically depending on how much data is actually copied for each backup.  Such software will simply inform the SA when to change the tapes in a jukebox.

On a busy (e.g., database) server downtime will be the most critical factor.  In such cases consider using LVM snapshots, which very quickly makes a read-only copy of some logical volume using very little extra disk space.  You can then backup the snapshot while the rest of the system remains up.

Another strategy is called disk-to-disk-to-tape, in which the data to be backed up is quickly copied to another disk and then written to the slower backup medium later.

Other Policies

Deciding what to backup is part of your policy too.  Are you responsible for backing up the servers only?  Boss’ workstation?  All workstations? (Users need to know!)  Network devices (e.g., routers and switches)?  It may be appropriate to use a different backup strategy for user workstations than for servers, for different servers, or even different partitions/directories/files of servers.

An often overlooked item is the MBR/GPT.  Make a copy of it with:

dd if=/dev/sda of=/tmp/MBR.bak bs=$SECTOR_SIZE count=1

Another part of your backup policy is determining how long to keep old backups around.  This is called the backup retention policy.  In many cases it is appropriate to retain the full backups indefinitely.  In some cases, backups should be kept for 7 to 15 years (in case of legal action or an IRS audit).  In some cases, you must not keep certain data for too long or you may face legal penalties.

Such records are often useful for more than disaster recovery.  You may discover your system was compromised months after the break-in.  You may need to examine old files when investigating an employee.  You may need to recover an older version of your company’s software.  Such records can help if legal action (either by your company for by someone else suing your company) occurs.

Since Enron scandal (2001) and Microsoft scandals (when corporate officers had emails subpoenaed by DoJ), a common new policy is “if it doesn’t exist it can’t be subpoenaed!”  These events led to a revision of the FRCP:

FRCP — the Federal Rules of Civil Procedure

These include rules for handling of ESI (Electronically Stored Information) when legal action (e.g. lawsuits) is immanent or already underway.  You must suspend normal data destruction activities (such as reusing backup media), possibly make “snapshot” backups of various workstations, log files, and other ESI, classify the ESI as easy or hard to produce, and the cost to produce the hard ESI (which the other party must pay), work out a “discovery” (of evidence) plan, and actually produce the ESI in a legally acceptable manner.  An SA should consult the corporate lawyers well in advance to work out the procedures.

It is important to decide where store the backup media (storage policy).  These tapes or CDs contain valuable information and must be secured.  Also it make no sense to store media in the same room as the server the backup was made from; if something nasty happens to the server, such as theft, vandalism, fire, etc., then you lose your backups too.  A company should store backup media in a secure location, preferably off-site.  A bank safe-deposit box is usually less than $50 a year and is a good location to store backup media.  If on-site storage is desirable, consider a fire-proof safe.  (And keep the door shut all the time!)  Consider remote storage companies but beware of bandwidth and security issues.

Media Replacement Policy (a.k.a. media rotation policy)

Backup media will not last forever.  Considering how vital the backups might be, it is a false economy to buy cheap tape or reuse the same media over and over.  A reasonable media replacement policy (also known as the media rotation schedule) is to use a new tape a fixed number of times, then toss it.  The rotation schedule/replacement policy can have a major impact on the cost of backups and the speed of recovery.

Before using new media for the first time, test it and give it a unique, permanent label (number).  Annual backup tapes could be duplicated just in case the original fails.

A simple policy is called incremental rotation which means different things to different folks.  Basically you number the media used for a given cycle, such as D1–D31 for the 31 tapes used for daily backups.  After one complete cycle (with each tape used once), the next cycle uses tapes D2–D32; tape D1 gets re-labeled as M1.  The 12 monthly tapes M1-M12 are each used once, then M2-D13 is used the following year, etc.  The old M1 tape becomes permanently retired (or archived).  Thus a given tape will be used 31 times for daily, 12 times for monthly, and once for yearly backups, a total of 43 uses before you need to replace it.

One of the most popular schemes is called grandfather, father, son (GFS) rotation.  (This term predates political correctness.)  This scheme uses daily (son), weekly (father), and monthly (grandfather) backup sets.  Here’s an illustration (from mckinnonsc.vic.edu.au/vceit/backups/backup_schemes.htm):

 • Monday - daily backup to tape #1
 • Tuesday - daily backup to tape #2
 • Wednesday - daily backup to tape #3
 • Thursday - daily backup to tape #4
 • Friday - weekly to tape #5.  This tape is called Week 1 Backup.

(Of course you can extend this idea to seven day schemes as well.)

The time for a restore depends if incremental or differential backups are done for daily and weekly.  In this scheme the monthly backup is usually a full backup, but doesn’t need to be if you use a 4 or 5 level differential backup scheme.

The next Monday through Thursday you would re-use tapes #1 through #4 for the daily backup set.  But next Friday you do another weekly backup to Week 2 backup (tape #6).  Week 3 is the same as week 2, using Week 3 backup (tape #7) on Friday.  Tapes 5–7 form the weekly backup set.

At the end of the fourth week do a monthly to Month 1 backup tape (tape #8).  At the end of the fifth week, Week 1 tape is re-used.

So, the daily tapes are recycled weekly.  The weekly tapes are recycled monthly.  Monthly tapes are recycled annually.  Each year a full annual backup is kept safely stored and never re-used.

Clearly, the daily tapes (tapes 1–4) are used much more often than the weeklies (tapes 5–7) and monthlies (tapes 8–∞).  This will mean they will suffer more wear and tear and may fail more readily.

The incremental scheme can be used with any rotation policy, such as GFS or Towers of Hanoi.  A tape (e.g. tape 1) will be used as a weekly tape after a month (or two or more) of daily use.  After a year (or two or more) of weekly use, it can be error-checked (in case it’s becoming unreliable) and will be used as a monthly tape.  After 12 (or 24 or more) uses as a monthly tape it could be “retired” as a permanent yearly backup tape.

Most software that automates backup uses the tower of Hanoi method, which is more complex but does result in a better policy.

Class discussion: Determine backup policies for YborStudent server.  One possibility:  Full backup (level 0) of /home one per term, level 1 once per month, level 2 each day.  The SLA will specify a recovery time of a maximum of 2 working days.  Backups should be kept for 6 months after the end of the semester.

For security reasons you should completely erase the media before throwing the media in the trash.  (This is harder than you think!) An alternative is to shred or burn old media, and/or encrypting backups as they are made.

Backup Media Choices

There are too many choices to count today.  For smaller archives, flash or other removable disks, writable CD-ROMs or DVDs, (These are WORM media) and old fashioned DLT, DAT, DDS-{2,4,8,16} tape drives are popular.  (I used a DDS-2 SCSI drive at home.)  Today consider LTO drives.  These are fast (for tape) and have a range of densities; LTO4 tapes can hold up to 1.6 TB each.

Tape storage is very cheap, typically less than $20 for 80 gigabytes of storage.  (DDS-2 tapes cost about $7 and hold 4 GB each.  DDS-4 tapes are fast backups and hold ~100GB each.)  However tapes and other magnetic media can be affected by strong electrical and magnetic fields, heat, humidity, etc.  Also, the higher density tapes require more expensive drives (some over $1,000).

In 2010, the record for how much data magnetic tape could store was 29.5GB per square inch.  To compare, a quad-layer Blu-ray disc can hold 50GB per disk. Magnetic tapes can be hundreds of feet long.  In 2014, Sony announced that it developed new magnetic tape material that can hold 148GB per square inch. With this material, a standard backup tape (the size of an old cassette) could store up to 185TB. To hold the equivalent amount of data would take 3,700 dual-layer 50GB Blu-rays (a stack that would be over 14 feet tall). — extremetech.com

An external hard drive (less than $100 for 1TB) connected directly to your PC can use the backup program that comes with your operating system (Backup and Restore Center on Windows, and Time Machine on OS X).  Most backup software can automate backups of all new files or changed ones on a regular basis.  This is a simple option if you only have one PC.

Optical media such as CDs are durable and fairly cheap but take longer to write.  They can be reused less often than magnetic media, and are still susceptible to heat and humidity.  Optical media can scratch if not carefully handled.  Also consider the bulk of the media.  If you must store seven year’s worth of backups, it may be important to minimize the storage requirements and expense.  A CD-ROM can hold about 700 MiB while a dual-layer Blu-ray can hold 50 GiB.

A choice becoming popular (since 2008) is on-line storage, e.g., HP Upline, Google GDrive, etc. (for SOHO, Mozy or BackBlaze).  (This market changes rapidly so do research on current companies.)  The companies offer cheap data storage and complete system backups, provided you have a fast Internet connection.  Many colocation facilities (“colos”, usually at network exchange points) provide this service as well to the connected ISPs.  If you go this route, make sure all the data is encrypted using industry standard encryption at your site before transmission across the Internet.  (Never use any company that uses “proprietary” encryption regardless of how secure they claim it is!)

When backing up large transaction database files, the speed of the media transfer is important.  For instance, a 6 Mbps (Megabits per second) tape drive unit will backup 10 gigabytes in about 3 hours and 45 minutes.  (In most cases incremental or differential backups contain much less data!)

For IDE controllers you only choice is a TRAVAN backup drive.  Very slow, don’t use!  For SCSI drives (such as DDS drives from HP) there are two speeds for the SCSI controller, depending on what devices are on it.  A tape drive will slow down the SCSI bus by half, so consider dual SCSI controllers.

For networks, consider a networked backup unit.  This would allow a single backup system to be used with many different computers.  Thus you can buy one high-speed device for about the same money as several lower-speed devices.  Keep in mind however that a network backup can bring a standard Ethernet network to its knees.  (The network only shares 10 mbps for all users on a SOHO or wireless LAN.)  Even a Fast Ethernet (100 Mbps) LAN might suffer noticeable delays and problems.

An excellent choice for single-system backup is a USB disk.  Also using SAN/NAS to centralize your storage makes it easy to use a single backup system (robot tapes).

It is a good idea to have a spare media drive (e.g., DLT tape drive), in case the one built into a computer fails when the computer fails.  This is especially true for non-standard backup devices that may not be available from CompUSA on a moment’s notice.  Regularly clean and maintain (and test) your backup drives.

(While I don’t know of any organization that does this, consider copying old data to new hardware once the old drives are no longer supported or available.  If you don’t have a working drive (including drivers), the old backups are useless!)

Consumables Planning (Budgeting)

Suppose a medium to large organization uses 8 backup tapes a day, 6 days a week, means 48 tapes.  If your retention policy is to keep 6 months’ worth of incrementals, that’s 1,248 tapes needed.  High capacity DLT tapes might go for $60, so you would need $74,880.00.  In the second part of the year, you only need new tapes for full backups, an additional 260 tapes (say) for $15,600, or more than $90k for the first year ($7,500 per month).  (Not counting spares or the cost of drive units.)  Changes to the policies can result in expense differences of over $1000 per month!

As backup technology changes over the years, it is important to keep old drives around to read old backup tapes when needed.  You should keep old drives around long enough to cover your data retention policies.

Try to avoid upgrading your backup technology (drives, tapes, software) every few years, or you’ll end up with many different and incompatible backup tapes.  Note the budget must include amortization of the drive expenses.

Tools for Archives and Backups

Archives are easier to make than backups, so most tools create archives.  A tool cannot make a “backup” without knowing the underlying filesystem intimately, i.e. it must parse the filesystem on disk.  The reason is twofold:

·       Different filesystems exhibit different semantics.  No single tool supports all the semantics of all filesystem types.  You need a different backup tool per FS type.

·       The kernel interface obfuscates information about the layout of the file on disk.  You have to go around the kernel, direct to the device interface, to see all the information about a file that is necessary for recording it correctly.

If you want to store the kernel’s view of files along with all of the semantics the filesystem provides and none of the non-filesystem objects that might appear to inhabit the filesystem (such as sockets or /proc entries), use the native dump program (and restore) provided by your vendor specifically for that purpose (whatever they name it), for your filesystem type (note for Reiser4Fs you can just use star).  dump uses /etc/dumpdates to track dump levels (that is, dump supports differential backups).  Some of the differences between dump (for backups) and tar, cpio, or pax for archives are:

1.    dump is not confused by object types that the particular operating system has defined as extensions to the standard filesystem; it also does not attempt to archive objects that do not actually reside on the filesystem, e.g. doors and sockets.  Consider what GNU tar does to UNIX-domain sockets: it archives them as named pipes.  They are not on the filesystem, so really they should not be archived at all.  dump handles this situation correctly.

2.    tar ignores extended attributes (and ACLs unless you use the --acls option when creating or adding to the archive), while a native dump program will correctly archive them.  (A new extensible backup format known as pax will archive ACLs, SELinux labels, and other meta-data stored in extended attributes.  A tool called star used this format.  Find out about star on the web.)

3.    tar cannot detect reliably where holes are.  dump is not confused by files with holes (such as utmp); it will dump only the allocated blocks and restore will reconstruct the file with its original layout.

4.    tar uses normal filesystem semantics to read files.   That means it modifies the access times recorded in the filesystem inodes, when extracting files.  This effectively deletes an audit trail which you may require for other purposes.  (Modern Gnu tar has extra options to handle this correctly.)  dump parses and records the filesystem outside of kernel filesystem semantics, and therefore doesn’t modify the filesystem in the process of copying it.

Not all filesystem types support dump and restore utilities.  When picking a filesystem type, keep in mind your backup requirements.

GNU tar is a popular tool for archiving the user’s view of files.  Another standard (and free) choice is cpio.  Note neither tool is standardized by POSIX.  A new standard tool, based on both (and hopefully better than either) is pax.  These, combined with find and some compression program (such as gzip or bzip2) are used to easily make portable archives.

You can ask find to locate all files modified since a certain date, and add them to a compressed tar archived created on a mounted backup tape drive.  A backup shell script can be written, so you don’t end up attempting to backup /dev or /proc files.  (See backup script on web page.)  (Note! Unix tar ≠ GNU tar; use the GNU version.  Unix tar doesn’t handle backups that require multiple tapes.)

For either backups or archives, use crontab to schedule backups according to the backup schedule discussed earlier.  (Show ls -d /etc/*cron*.)  If your company prefers to have a human perform backups, remember that root permission will be needed to access the full system.  Often the backup program is controlled by sudo or a similar facility, so the backup administrator doesn’t need the root password.

The find command can be used to locate which files need to be backed-up.  Use “find / -mtime -x” for incrementals and differentials to find files changed since x (you can store x as a timestame on files, for example /etc/last-backup.{full,incremental,differential} ).  Use find with tar roughly like this:

mount /dev/removable-media /mnt

find / -mtime -1 -depth | xargs tar rf --acls /tmp/$$

gzip /tmp/$$; mv /tmp/$$.gz /mnt/incremental-6-20-01

touch /etc/last-backup.incremental

umount /dev/removable-media

(Instead of “-mtime -1” to mean less than 24 hours ago, you can use “-newer x”, where x is some file.)

Commercial software is affordable and several packages are popular for Unix and Linux systems, including “BRU” (www.BRU-Pro.com), VERITAS, Seagate’s BackupEXEC, and “Arkeia” (www.arkeia.com).  (I haven’t used these, I just use tar and find.)

Of course there are free, open source choices as well, such as KDE ark, or amanda (network backups).  One of the best is BackupPC.  Another is Bacula (or Bareos, a fork of the original).  Some of these can create schedules, label tapes, encrypt tapes, follow media rotation schedules, etc.

Be careful of bind mounts and private mounts when performing backups, especially when using home-grown scripts that use find, tar, etc.!  Tools such as Bareos will detect symlinks and bind mounts and not “decend into” those, but not all tools will (or may not by default).  Bareos backs up the symlink, not duplicates of the files.

In addition, bind mounts are only known to the kernel, in RAM, and are never backed up; you need to list those in fstab to restore those “views”.

Finally, if you don’t run the backup with root privilege, you may only back up the polyinstantiated (a per-user “ private view”) part of a directory that the process can see.

The most important tool is the documentation: the backup strategy, media types and rotation schedule, hardware maintenance schedule, location of media storage (e.g., the address of the bank and box number), and all the other information discussed above.  This information is collectively referred to as the backup policy.  This document should clearly say to users what will be backed up and when, and what to do and who to contact if you need to recover files.

Note:  Whatever tools you use, make sure you test your backup method by attempting to use the recovery procedure.  (I know someone who spent 45 minutes each working day doing backups for years, only to realize none of the backups ever worked the first time he attempted to recover a file!)

(Parts of this section were adopted from netnews (Usenet) postings in the newsgroup “comp.unix.admin” during 5/2001 by Jefferson Ogata.  Other parts were adopted from The Practice of System and Network Administration, by Limoncelli and Hogan, ©Addison-Wesley.)

Backups with Solaris Zones (or other containers)

Solaris zones, Docker containers, and similar technology contain a complication for backup: many standard directories are actually mounted from the global zone via LOFS (loopback filesystem).  These should only be backed up from the global zone.  The only items in a local zone needing backup (usually) are application data and configuration files.  Using an archive tool (such as cpio, tar, or star) will work best:

find export/zone1 -fstype lofs -prune -o -local \
| cpio -oc -O /backup/zone1.cpio

Whole zones can be fully or incrementally backed up using ufsdump.  Shut down the zone before using the ufsdump command to put the zone in a quiescent state and avoid backing up shared file systems, with:

global# zlogin -S zone1 init 0

Solaris supports filesystem snapshots (like LVM does on Linux) so you don’t have to shut off a zone.  However it must be quiesed by turning off applications before creating the snapshot.  Then you can turn them back on and perform the backup on the snapshot:  Create it with:

global# fssnap -o bs=/export /export/home #create snapshot
global# mount -o ro /dev/fssnap/0 /mnt # then mount it.

You should make copies of your non-global zones’ configurations in case you have to recreate the zones at some point in the future. You should create the copy of the zone’s configuration after you have logged into the zone for the first time and responded to the sysidtool questions:

    global# zonecfg -z zone1 export > zone1.config

Adding a backup tape drive

Added SCSI controller (ADAPTEC 2940)

Added SCSI DDS2 Tape drive

On reboot kudzu detected and configured SCSI controller and tape device

Verify devices found with ‘dmesg’: indicate tape is /dev/st0 and /dev/nst0

Verify SCSI devices with ‘scsi_info’ (/proc/scsi)

Verify device working with: mt -f /dev/st0 status

Create link: ln -s /dev/nst0 /dev/tape

Verify link: mt status

Note: /dev/st0 causes automatic tape rewind after any operation, /dev/nst0 has no automatic rewind, but most backup software knows to rewind before finishing.  If you plan to put multiple backup files on one tape, you must use /dev/nst0.

Common Backup and Archive Tools:

mt  (/dev/mt0, /dev/rmt0)

st  (/dev/st0, /dev/nst0 - use nst for no auto rewind)

mt and rmt (remote tape backups); use like: mt -f /dev/tape command, where command is one of: rewind, status, erase, retention, compression, (toggle compression on/off), fsf count (skip forward count files), eod (skip to end of data), eject, ...

dump/restore (These operate on the drive as a collection of disk blocks, below the abstractions of files, links and directories that are created by the file systems.  dump backs up an entire file system at a time.  It is unable to backup only part of a file system or a directory tree that spans more than one file system.)

tar, cpio, ddstar (and pax and spax)

A comparison of these tools:

·       cpio has many more conversion options than tar and supports many formats.

·       cpio can be used as a filter, reading names of files from stdin.  (Gnu tar has this ability too.)

·       On restore, if there is corruption on a tape tar will stop at that point.  cpio will skip over corruption and try to restore the rest of the files.

·       cpio is reported faster than tar, and uses less space (because tar uses 512 byte blocks for every file header, cpio just uses whatever it needs only).

·       tar supports multiple hard links on FSes that have 32 bit inode numbers, but cpio can only hand up to 18 bits in the default mode.

·       tar copies a file with multiple hard links once, cpio each time.

·       Gnu tar can support archives that span multiple volumes; cpio can too but is known to have some problems with this.

·       Modern tar (star) supports extended attributes, used for SELinux and ACLs.  cpio doesn’t (currently).

·       pax is POSIX’s answer to tar and cpio shortcomings.  pax attempts to read and write many of the various cpio and tar formats, plus new formats of its own.  Its command set more resembles cpio than tar.  Unfortunately, to use extended attributes and ACLs, the “pax” archive format must be used, and Linux pax doesn’t support this (POSIX required) format as of 6/2013 (bug is being worked on by Red Hat).  Check available formats with “pax -x help”.  Use star or spax instead on Linux.

·       dd is a command that copies and optionally converts data.  It can convert data to different formats, block sizes, byte orders, etc.  It isn’t generally used to create archives, but is often used to copy disks and partitions (to other disks/partitions when the geometry is different), copy large backup files, to create remote archives (“tar ...|ssh ... dd ...”), and to copy and create image files.  The command was part of the ancient IBM mainframe JCL utility set (and has a non-standard syntax as a result); no one knows anymore what the name originally meant.

libarchive is a portable library for any POSIX system, including Windows, Linux, and Unix, that provides full support for all formats.  Currently it includes two front-end tools built with it: bsdtar and bsdcpio. These will support extended attributes and ACLs, when used with the correct options and with the “pax” (and not the default “ustar”) format.

If you need to backup large (e.g., DB) files, use a larger blocksize for efficiency.

Many types of systems can use LVM, ZFS or some equivalent that supports snapshots for backup without the need to taking the filesystem off-line.

NAS (and some SANS) systems are commonly backed up with some tool that supports NDMP (the Network Data Management Protocol), which usually works by doing background backup to tape of a snapshot.  This has a minimal effect on users of the storage system.

To ensure the validity of backups and archives, you should compute and compare checksums.  Here’s one way:

tar cf - dir | tee xyz.tar | md5sum > xyz.tar.md5

Additional Tools

Jörg Schilling’s star program currently supports archiving of ACLs.  IEEE Std 1003.1-2001 (“POSIX.1”) defined the “pax interchange format” that can handle ACLs and other extended attributes (e.g., SELinux stuff).  Gnu tar supposedly handles pax and star formats.  There is also a spax tool that supports the star extensions.

However the only tool that supposedly easily and correctly backs up ACLs, ext2/3 attributes, and extended attributes (such as for SELinux) is “bsdtar”, a BSD modified version of tar that uses libarchive.so to read/write a variety of formats.  Personally, I’ve only had good luck with spax and star.

amanda (a powerful network backup utility, producing backup schedules automatically but relying on other tools for the actual backups.  Most other tools don’t supprt schedule creation.)

BackupPC (an “enterprise-grade” utility)

Bacula (works very well and is popular, but has a steeper learning curve than most.  See also Bareos, a fork of Bacula by some of the original developers.)

bru (commercial sw)

Clonezilla (similar to commercial Ghost)

HP Data Protector (commercial sw, used at HCC)

Mondo Rescue

unison (uses rsync)

LuckyBackup (similar to Unison)

vranger (commercial sw, designed for VMware backups, from quest.com)

foremost, ddrescue, ...  These are not backup tools, but recovery tools when a filesystem is corrupted and you need to salvage what you can.

Duplicity (Uses rsync to create encrypted tar backups.)

Rsnapshot (A wrapper around rsync.  Rsync is not designed for backup, but can be used for that in some cases.)

rdiff-backup (stores meta-data in a file, so can easily restore files to alternate systems.  Produces smaller backups than Rsnapshot, is easier to use, but is slower.)

Storix (supports AIX & Linux)

s3ql (backs up to Amazon's S3 cloud)

grsync (GUI for rsync).

Using rsync

rsync is a versatile tool that does incremental archives, either locally or across a network.  rsync has about a zillion options but is worth learning.  To understand its options (and diagnose performance problems) you need to understand how rsync works.

First, rsync reads last modified timestamp and the length of both the source and destination files.  If they are the same, it does not transfer anything.  If either is different, rsync reads both the source file and the destination file, and performs progressively smaller hashes on them, to determine which parts differ.  After this analysis, rsync copies the similar parts of the destination file to a new file, and then copies the different parts from the source file and inserts them into the proper places in the new file.  Finally, rsync copies the updated new file to the destination file location.  Note the timestamp of the new file is the current time, not the time of the source file; usually you will need to include the “-t” option to update the destination file’s timestamp to match the source’s timestamp.

You can speed up rsync by eliminating the new file, and updating the destination file in-place.  However, that is dangerous if your network connection is unreliable.  It does preserve hard links though.

rsync uses compression to reduce bandwidth use and make the transfer faster.  You can use SSH to make the transfer secure.

You can run rsyncd as a daemon on the remote end (port 873).  This makes the transfer go faster (no need to fork rsync each time, and have it calculate file lists each time), more controllable (via a config file), and allows you to push files to (say) a Windows disk that has a different layout and paths.  As a server daemon, rsync acts like a modern, super-charged FTP server (and is often used to provide mirror sites on the Internet).  You don’t get SSH security however.

A serious performance concern is when copying files from different systems.  The timestamps may not match, causing rsync to unnecessarily copy the whole file.  This can happen when two systems have clocks that are slightly off.  It can also happen when using filesystems with different granularity for the timestamps.  For example, most Flash drives use FAT, which records timestamps only to within two seconds.  You need to use the rsync --modify-window=1 option in that case, to have rsync treat all timestamps within one second as equal.

The syntax is “rsync options source destination”.  Either the source or destination (but not both) can be to remote hosts.  To specify a remote location, use “[user@]host:path”.  A relative path is relative to the user’s home directory on host.  (Filenames with colons can cause problems, not just with rsync.  Colons after slashes work fine, so use “./fi:le” instead of “fi:le”.)

The archive (“-a”) option is a shorthand for several others.  It means to preserve permissions, owner and group, timestamps, and symlinks.  The “-z” option enables compression.  The “-R” option copies pathnames, not just the filenames.  The option “-u” says don’t copy files if a newer version exists at the destination.

To make a backup of /home to server.gcaw.org with rsync via ssh:

rsync -avre "ssh -p 2222" /home/ server.gcaw.org:/home
rsync -azv me@server.gcaw.org:documents documents
rsync -azv documents me@server.gcaw.org:documents

rsync -HavRuzc /var/www/html/ example.com:/var/www/html/
# or copy ~/public_html to/from me@example.com:public_html/

rsync -r ~/foo ~/bar # -r means recursive
rsync -a ~/foo ~/bar # -a means archive mode
rsync -az ~/foo remote:foo # copies foo into foo
rsync -az ~/foo/ remote:foo # copies foo's contents
rsync -azu ~/foo/ remote:foo #don't overwrite newer files

The meanings of some commonly used options are:

-v = verbose,
-c = use MD4 (not just size and last-mod time) to see if dest file different than src,
-a = archive mode = -rlptgoD = preserve almost everything,
-r = recursive, -R = preserve src path at dest, -z = compress when sending,
-b = backup dest file before over-writing/deleting,
-u = don’t over-write newer files,
-l = preserve symlinks, -H = preserve hard links, -p = preserve permissions,
-o = preserve owner, -g = preserve group, -t = preserve timestamps,
-D preserve device files, -S = preserve file “holes”,
--modify-window=X = timestamps match if diff by less than X seconds
  (required on Windows, which only has 2 second time precision)

Modern rsync has many options to control the attributes at the destination.  You can use --chmod, xfer ACLs and EAs.  You can create rsyncd.conf files, to control behavior (and use a special ssh key to run a specific command), and define new arguments via ~/.popt.  But older rsync versions don’t have all those features.  For example, old rsync had no options to set/change the permissions when coping new files from Windows; umask applies.  (You can use special ssh tricks to work around this, to run a “find ... | xargs chmod...” command after each use of rsync.)

A good way to duplicate a website is to set a default ACL on each directory in your website.  Then all uploaded files will have the umask over-ridden:

  cd ~/public_html # or wherever your web site is.
  find . -type d -exec xargs setfacl -m d:o:rX {} +

This ACL says to set a default ACL on all directories, to provide “others” read, plus execute if a directory.  (New directories get default ACL too.)  With this ACL, uploading a file will have 644 or 755 permissions, rather than 640 or 750.

Other tools can be used to backup (or migrate) data across a network, including tar (pipe through ssh), BitTorrent (can use multiple TCP streams at once), and others.

Using cpio:

cd /someplace/..  # the parent of “someplace”
find someplace -depth \
  | cpio -oV --format=crc >someplace.cpio
 # crc=new SysVr4 format with CRCs

Note that when creating an archive, “-v” means to print filenames as processed; “-V” means to print a dot per file processed.

 # Restore all; -d means to create directories if needed;
 #                  -m means to preserve modification timestamps:
cpio -idm < file.cpio

# Restore; wildcards will match leading dot and slashes:
cpio -idum glob-pattern <file.cpio

Without the -d option, cpio won’t create any directories when restoring.  Without the -u option, cpio won’t over-write existing files.  Add -v to show files as they are extracted (restored).

cpio -tv < file.cpio # table of contents (-i not needed but allowed)

Command to backup all files:
  find . -depth -print | \
    cpio -o --format=crc > /dev/tape

Command to restore complete (full) backup:
 cpio -imd < /dev/tape

Command to get table of contents:
 cpio -tv < /dev/tape  # -v is long listing

Using pax:

find ... | pax -wv > pax.out
pax [-v] < pax.out  # default is to read and extract
pax -rv -pe < pax.out # -pe means preserve everything; spax also has -acl option

Avoid using absolute pathnames in archives.  tar strips out a leading “/” but cpio and pax do not:

To duplicate some files or a whole directory tree requires more than just copying the files.  Today you need to worry about ACLs, extended attributes (SELinux labels), non-files (e.g., named pipes, or FIFOs), files with holes, symlinks and hard links, etc.  Gnu cp has options for that, but can’t be used to copy files between hosts.

The best way to duplicate a directory tree on the same host is Gnu cp -a, or if not available, use:

    [s]pax -pe -rw olddir newdir

To copy a whole volume to another host you can use dump and then transfer that, and restore it on the remote system.  Files or backups and archives can be copied between hosts with scp or rsync.

Tar or cpio is often used to duplicate a directory tree to the same host if Gnu cp isn’t available.  These tools can also be used to duplicate a directory tree to a different host, via ssh:

tar czf - -C sourcedir files \
| ssh remote_host 'tar xzf - -C destdir'

Use tar with ssh if this is a complete tree transfer.  For extra performance, use different compression (e.g., “-j” for bzip2).  You may need extra options to control what happens with ACLs, links, etc.

Using rsync over ssh often performs better than tar if it is an update (i.e., some random subset of files need to be transferred).  (Show ybsync alias on wpollock.com.)

(Show backup-etc script.)