Backups and Archives

Backups and Archives

Statistic: 94% of companies suffering from a catastrophic data loss do not survive; 43% never reopen and 51% close within two years. (source)

Backups are made for rebuilding a system that is identical to the current one. Backups are thus for (disaster-) recovery, not transferring of data to another system. They do not need to be portable. In this sense, “backup” is used to mean a complete backup of an entire system: not just regular files but all owner, group, date, and permission info for all files, links, /dev entries, some /proc entries, etc.

Backups can be used not only for “bare metal” backup and recovery, but also to install many identical computers. Clonezilla allows you to do just that, similar to the commercial product “Ghost”. Clonezilla saves and restores only used blocks in the hard disk. This increases the clone efficiency. In one example, Clonezilla was used to clone one computer to 41 other computers simultaneously. It took only about 10 minutes to clone a 5.6 GiB system image to all 41 computers via multicasting.

Archives are for transferring data to other systems, operational (day-by-day, file-by-file) recovery, or making copies of files for historical or legal/regulatory purposes. As such, they should be portable so that they may be recovered on new systems when the original systems are no longer available. For example, it should be possible for an archive of the files on a Solaris Unix system to be restored on an AIX Unix, or even a Linux system. (Within limits, this portability should extend to Windows and Macintosh systems as well.)

A backup has a drawback for operational backup-and-restore needs. While backups enable rapid recovery and minimal downtime in the event of a disaster, they also backup any malware-infected or otherwise corrupted files. You generally do not have more than one backup version of storage. Archives however, are made often and many previous versions are available.

Most of the time, the two terms are used interchangeably. (In fact, the above definitions are not universally agreed upon!) In the rest of this document, the term “backup” will be used to mean either a backup or an archive as defined above. Most real-world situations call for archives, since the other objects (such as /dev entries) rarely if ever change on a production server once the system is installed and configured. A single “backup” is usually sufficient. For home users, the original system CDs often serve as the only backup need; all other backups are of modified files only and hence are “archives”.

Using RAID is not a replacement for regular backups! (Imagine a disaster such as a fire on your site, an accidentally deleted file, or corrupted data.)

Service Level Agreement (SLA)

Creating backup policies (includes several sub-policies, discussed below) can be difficult. Keep in mind the operational backup requirements of the organization, often specified in an SLA or service level agreement. Make sure users/customers are aware of what gets backed up and what doesn’t, how to request a restore, and how long it might take to restore different data from various problems (very old data, a fire destroys the hardware, a DB record accidentally deleted from yesterday, ...).

Statistics from EMC.com:

· 80% of restore requests are made within 48 hours of the data loss.

· Around 15% of a storage administrator’s time is spent on backup and recovery.

· Between 5% and 20% of backup/restore jobs fail.

· In 2004, backup and recovery costs were about $6,000 per TB of data, per year.

(Show example SLA from Technion University.)

You should know these related definitions, not just for backups but in general:

· SLI — A Service Level Indicator is a measurable metric, whose values can be classified as either good or bad. An example SLI for a web service: 99% of all requests in a calendar year should have a response time of under 200 ms.

· SLA — A service level agreement is a list of SLIs that define the required behavior of some service. This should include all failure modes. Examples include what happens if the data center loses power, or if available network bandwidth is exhausted.

· SLO — A Service Level Objective is also a list of SLIs, but instead of listing guarantees it lists the level of service you are aiming for. For an SLA with the sample SLI from above, the SLO might be “99.99% of all requests complete in 200 ms or less”. (Often, SLAs are composed of several SLOs, rather than specific SLIs.) See Wikipedia for a good discussion of these terms.

The business world often uses many acronyms with overlapping meaning. Besides SLA and SLO, you may see references to RPO and RTO when considering backups and recovery.

Recovery Point Objective (RPO) refers to the point in time in the past to which you will recover. It is the last point in time you did a backup.

Recovery Time Objective (RTO) refers to the point in time in the future at which you will be up and running again. It is the time to restore.

More frequent backups lead to smaller RPOs. Some technologies such as disk-to-disk backups lead to smaller RTOs. Both of these need to be specified in an SLA.

Most people underestimate how slow a restore operation can be. It is often 10-20 times longer to restore a file than to back one up. (One reason: operating systems are usually optimized for read operations, not write operations.)

An example SLA: Customers should be able to recover any file version (with a granularity of a day) from the past 6 months, and any file version (with a granularity of a month) from the past 3 years. Disk failure should result in no more than 4 hours of down-time, with at worst 2 business days of data lost. Archives will be full backups each quarter and kept forever, with old data copied to new medium when the older technology is no longer supported. Critical data will be kept on a separate system with hourly snapshots between 7:00 AM and 7:00 PM, with a midnight snapshot made and kept for a week. Users have access to these snapshots. Database and financial data have different SLAs for compliance reasons.

Granularity refers to how often backups are made. For example, suppose backups are made each night at midnight. If some user edits a file six times in the last two days, they can’t get back any version; only the copy taken at midnight. In some cases, you want finer granularity (e.g., versions for each hour, or every single version) and in other cases, coarser granularity is fine (versions every month).

SLAs vary considerably. For example, your online sales system may require recovery granularity of one transaction, and recovery time in seconds. Emails may require per-email granularity for a day (or more) with recovery time in minutes, and 24-hour granularity for older email. On the other hand, old business records (such as stock-holder meeting minutes) can have granularity of days, and (except when regulated) recovery time in days.

You must also worry about security of your backups. Have a clear policy on who is allowed to request a restore and how to do so, or else one user might request a restore of other’s files. In some cases, this may be allowed by a manager or auditor. (In a small organization where everyone knows everyone, this is not likely to be a problem.)

The backup process and backup server must be made as secure as possible. Think about it: the backup server needs remote root access to all your production servers! If you automate backups, there is not even a password to protect the process!

To secure the backup server, do not use the server for anything else. Strip off unnecessary services and turn off the ones you cannot remove. Harden that server, and limit (incoming) access to SSH from a few internal IP addresses. Create a user (and group) just for the backup process to use.

To keep the process secure, have the backup process connect to the remote machines using an SSH key. (The key cannot be password protected if you wish to automate (have unattended) backups.) On the hosts to be backed-up, add the backup user with SSH key authentication only. Next, use SSH features to prevent that user from running any program except the backup software. That backup software will need root access. The best way (if possible) is to have that software run sudo to run the piece that actually needs to read the protected files. Finally, configure sudo to allow the backup user only to run (as root) that one command.

Types and Strategies of Backups and Archives

It is possible to backup only a portion of the files (and other objects in the case of a backup) on your systems. In fact, there are three types of backups (or archives):

1. Full (also known as “epoch” or “complete”) - everything gets backed-up.

2. Incremental - backup everything that has been added or modified since the last backup of any type (either incremental or full).

3. Differential - backup everything that has been added or modified since the last full backup. Differentials can be assigned levels: level 0 is a full backup and level n is everything that has changed since the last level n-1 backup. (Differential is sometimes called cumulative.)

(Many people don’t bother to distinguish between incremental and differential.) A system administrator must choose a backup strategy (a combination of types) based on several factors. The factors to consider are safety, required down time, cost, convenience, and speed of backups and recovery. These factors vary in importance for different situations. Common strategies include full backups with incremental backups in-between, and full backups with differential backups in-between (a two-level differential). Sometimes a three-level differential is used, but rarely more levels. (You rarely use both incremental and differential backups as part of a single strategy.) The strategy of using only full backups is rarely used.

What with modern backup software, the differences between the strategies mentioned above aren’t that large. Incrementals take less time to backup and more time to restore (since several different backup media may be needed) compared with differential backups (where at most two media, the last differential and the last full backup media, are used to recover a file). Full backups take a huge amount of time to make but recovery is very fast. (Example: disk corruption on the 25th of the month: recovery is last full then last differential, or last full then 24 daily incrementals.)

Most commercial software keeps a special file that is reset for each full backup and keeps track of which incremental tape (or disk or whatever) holds which files. This file is read from the last incremental tape during a restore, to determine exactly which tape to use to recover some file. Such information is known as backup metadata, or the backup catalog.

Backup Metadata contains information about what has been backed up, such as file names, time of backup, size, permissions, ownership, and most importantly, tracking information for rapid location and restore. It also indicates where it has been stored, for example, which tape.

When devising a backup strategy, it is critical to understand the nature of the data and the nature of changes to the data. Granularity levels depend on several considerations. What is the aggregate weekly data change rate? If the change rate were close to or greater than 100% (daily change about 20%), it makes little sense to use an incremental backup, because the overhead for deciding which files need to be backed up may be longer than the time it takes fora full backup.

Another consideration is file size. Some applications use larger files than others. An environment with such applications tends to have a larger data change rate, because even a small change to the data results in the whole file being changed. The larger the average file size, the greater the percentage of the data set. Other applications, like software development, use many smaller files. The rate of change in these environments can be much lower. In such environments, the more mature the data set, the lower the change rate.

Another factor to consider is the properties of the files in your backup set. For instance, are they natively compressible? if not, the negative impact compression has on performance makes it less desirable.

The Backup Schedule

The frequency of backups (the backup schedule) is another part of the policy. In some cases, it is reasonable to have full backups daily and incremental backups several times a day. In other cases, a full backup once a year with weekly or monthly incremental backups could be appropriate. A common strategy suitable for most corporate environments would be monthly full backups and daily differential backups. (Another example might be quarterly full (differential level 0) backups, with monthly level 1 differentials, and daily level 2 differentials.) However more frequent full backups may save tapes (as the incremental backups near the end of the cycle may be too large for a single tape).

Note that in some cases there will be legal requirements for backups at certain intervals (e.g., the SEC for financial industries, the FBI for defense industries, or regulations for medical/personal data). Depending on your backup software, it may be required to bring the system partially or completely off-line during the backup process. Thus, there is a trade-off between convenience versus cost, versus the safety of more frequent backups.

In a large organization, it may not be possible to perform a full backup on all systems on the same weekend; there is usually too much data to backup in the time window provided. A staggered schedule is needed, where (say) 1/4 of the servers get backed up on the first Sunday of the month, 1/4 the second Sunday, and so on. Each server is still being backed up monthly but not all on the same day of the month.

Be aware that small changes to the schedule can result in dramatic changes in the amount of backup media needed. For example, suppose you have 4GB to backup within this SLA: full backup every 4 weeks (28 days) and differential backups between. Now assume the differential backup grows 5% per day for the first 20 days (80% has changed) and stays the same size thereafter. Some math reveals that doing full backups each week (which still meets the SLA) will use a third the amount of the tape of a 28-day cycle, in this case.

Good schedules require a lot of complex calculation to work out (and still meet the SLAs). Modern backup software (such as Amanda) allows one to specify the SLA and will create a schedule automatically. A dynamic schedule will be adjusted automatically depending on how much data is actually copied for each backup. Such software will simply inform the SA when to change the tapes in a jukebox.

On a busy (e.g., database) server downtime will be the most critical factor. In such cases consider using LVM snapshots, which very quickly makes a read-only copy of some logical volume using very little extra disk space. You can then backup the snapshot while the rest of the system remains up.

Another strategy is called disk-to-disk-to-tape, in which the data to be backed up is quickly copied to another disk and then written to the slower backup medium later.

Sometimes different applications require independent backup of their data for various reasons (such as different security, retention, or SLA requirements). Examples include servers versus PCs, mobile devices (laptops and smart phones), different databases, email, log data, and so on. Even the backup procedures may be different. For example, an LVM snapshot won’t correctly backup a database that was in the middle of some operation or had data cached in memory at the time.

Policy questions that should be answered by SLA:

· What are the restore requirements – granularity (what points in time can be requested), and time for various types of recovery?

· Where and when will the restores occur?

· What are the most frequent restore requests?

· Which data needs to be backed up?

· How frequently should data be backed up (hourly, daily, weekly, monthly), and with what strategy (full, differential, incremental)?

· How long will it take to backup?

· How many copies to create?

· How long to retain backup copies?

Other Policies

Deciding what to backup is part of your policy too. Are you responsible for backing up the servers only? Boss’ workstation? All workstations? (Users need to know!) Network devices (e.g., routers and switches)? It may be appropriate to use a different backup strategy for user workstations than for servers, for different servers, or even different partitions/directories/files of servers.

An often-overlooked item is the MBR/GPT. Make a copy of it with:

dd if=/dev/sda of=/tmp/MBR.bak bs=$SECTOR_SIZE count=1

Another part of your backup policy is determining how long to keep old backups around. This is called the backup retention policy. In many cases, it is appropriate to retain the full backups indefinitely. In some cases, backups should be kept for 7 to 15 years (in case of legal action or an IRS audit). In some cases, you must not keep certain data for too long or you may face legal penalties.

Such records are often useful for more than disaster recovery. You may discover your system was compromised months after the break-in. You may need to examine old files when investigating an employee. You may need to recover an older version of your company’s software. Such records can help if legal action (either by your company for by someone else suing your company) occurs.

Since Enron scandal (2001) and Microsoft scandals (when corporate officers had emails subpoenaed by DoJ), a common new policy is “if it doesn’t exist it can’t be subpoenaed!” These events may have led to this revision of the FRCP:

FRCP — the Federal Rules of Civil Procedure

These include rules for handling of ESI (Electronically Stored Information) when legal action (e.g. lawsuits) is immanent or already underway. You must suspend normal data destruction activities (such as reusing backup media), possibly make “snapshot” backups of various workstations, log files, and other ESI, classify the ESI as easy or hard to produce, and the cost to produce the hard ESI (which the other party must pay), work out a “discovery” (of evidence) plan, and actually produce the ESI in a legally acceptable manner. An SA should consult the corporate lawyers well in advance to work out the procedures.

It is important to decide where store the backup media (storage policy/strategy). These tapes or CDs contain valuable information and must be secured. Also, it makes no sense to store media in the same room as the server the backup was made from; if something nasty happens to the server, such as theft, vandalism, fire, etc., then you lose your backups too. A company should store backup media in a secure location, preferably off-site. A bank safe-deposit box is usually less than $50 a year and is a good location to store backup media. If on-site storage is desirable, consider a fire-proof safe. (And keep the door shut all the time!) Consider remote storage companies but beware of bandwidth and security issues.

Different storage methods offer different levels of accessibility, security and cost for your backup data:

· Online storage: Sometimes called secondary storage, online storage is typically the most accessible type of data storage. A good example would be a large disk array. This type of storage is very convenient and speedy, but is relatively expensive and vulnerable to being deleted or overwritten, either by accident, or in the wake of a data-deleting virus payload. (HCC has an online storage site in Lakeland.)

· Near-line storage: Sometimes called tertiary storage, near-line storage is typically less accessible and less expensive than online storage. A good example would be an automatic tape library. Near-line storage is used for archival of rarely accessed information, since it is much slower than secondary storage.

· Offline storage: An example of offline storage is a computer storage system which must be driven by a human operator before a computer can access the information stored on the medium. For example, a media library system which uses off-line storage media, as opposed to near-line storage, where the handling of media is automatic.

· Off-site vault: To protect against a disaster or other site-specific problem, many people choose to send backup media to an off-site vault. The vault can be as simple as the system administrator’s home office or as sophisticated as a disaster hardened, temperature controlled, high security bunker that has facilities for backup media storage.

In most cases, a mix of storage methods can be the most effective storage strategy.

Media Replacement Policy (a.k.a. Media Rotation Policy)

Backup media will not last forever. Considering how vital the backups might be, it is a false economy to buy cheap tape or reuse the same media over and over. A reasonable media replacement policy (also known as the media rotation schedule) is to use a new tape a fixed number of times, then toss it. The rotation schedule/replacement policy can have a major impact on the cost of backups and the speed of recovery.

Before using new media for the first time, test it and give it a unique, permanent label (number). Annual backup tapes could be duplicated just in case the original fails.

There are many possible schemes for media rotation. A simple policy is called incremental rotation which means different things to different folks. Basically, you should number the media used for a given cycle, such as D1–D31 for the 31 tapes used for daily backups. After one complete cycle (with each tape used once), the next cycle uses tapes D2–D32; tape D1 gets re-labeled as M1. The 12 monthly tapes M1-M12 are each used once, then M2-D13 is used the following year, etc. The old M1 tape becomes permanently retired (or archived). Thus, a given tape will be used 31 times for daily, 12 times for monthly, and once for yearly backups, a total of 43 uses before you need to replace it.

One of the most popular schemes is called grandfather, father, son (GFS) rotation. (This term predates political correctness.) This scheme uses daily (son), weekly (father), and monthly (grandfather) backup sets. In each set, the oldest tape is used for the next backup. Here’s an illustration (from mckinnonsc.vic.edu.au/vceit/backups/backup_schemes.htm):

• Monday - daily backup to tape #1
• Tuesday - daily backup to tape #2
• Wednesday - daily backup to tape #3
• Thursday - daily backup to tape #4
• Friday - weekly to tape #5. This tape is called Week 1 Backup.

(Of course, you can extend this idea to seven-day schemes as well.)

The next Monday through Thursday you would re-use tapes #1 through #4 for the daily backup set. But next Friday you do another weekly backup to Week 2 backup (tape #6). Week 3 is the same as week 2, using Week 3 backup (tape #7) on Friday. Tapes 5–7 form the weekly backup set.

At the end of the fourth week do a monthly to Month 1 backup tape (tape #8). At the end of the fifth week, Week 1 tape is re-used.

So, the daily tapes are recycled weekly. The weekly tapes are recycled monthly. Monthly tapes are recycled annually. Each year a full annual backup is kept safely stored and never re-used.

Clearly, the daily tapes (tapes 1–4) are used much more often than the weeklies (tapes 5–7) and monthlies (tapes 8–∞). This will mean they will suffer more wear and tear and may fail more readily.

The incremental scheme can be used with any rotation policy, such as GFS or Towers of Hanoi. A tape (e.g. tape 1) will be used as a weekly tape after a month (or two or more) of daily use. After a year (or two or more) of weekly use, it can be error-checked (in case it’s becoming unreliable) and will be used as a monthly tape. After 12 (or 24 or more) uses as a monthly tape it could be “retired” as a permanent yearly backup tape.

Most software that automates backup uses the tower of Hanoi method, which is more complex but does result in a better policy.

Class discussion: Determine backup policies for YborStudent server. One possibility: Full backup (level 0) of /home one per term, level 1 once per month, level 2 each day. The SLA will specify a recovery time of a maximum of 2 working days. Backups should be kept for 6 months after the end of the semester.

For security reasons, you should completely erase the media before throwing the media in the trash. (This is harder than you think!) An alternative is to shred or burn old media, and/or encrypting backups as they are made.

The time for a restore depends if incremental or differential backups are done for daily and weekly. In this scheme, the monthly backup is usually a full backup, but doesn’t need to be if you use a 4 or 5 level differential backup scheme.

Backup Media Choices

There are too many choices to count today. For smaller archives, flash or other removable disks, writable CD-ROMs or DVDs, (These are WORM media) and old fashioned DLT, DAT, DDS-{2,4,8,16} tape drives were popular. (I used a DDS-2 SCSI drive at home.) While using tape for backups is no longer popular for consumers, it is more popular than ever for the enterprise, especially those with enormous amounts of data to backup.

Consider LTO (linear tape open) drives. These are fast (for tape: they max out at 800 MiB/S) and have a very low cost per byte. LTO tapes have a range of densities; LTO4 tapes can hold up to 1.6 TB each; the current (2019) standard, LTO7, holds 6 TB per tape, with a shelf life of 30-50 years, and costs around $50 to $75 each. An auto-loader LTO7 drive can cost around $6,000; a cheap drive is still around $2,700 (2020).

LTO formats change every few years, and an LTP-n drive can only work with LTO-n, n-1, or n-2 tapes. So every 5-8 years, old data must be migrated to the newest format. This gets expensive and can be time-consuming as well.

Tape storage is very cheap, typically less than $20 for 80 gigabytes of storage. (DDS-2 tapes cost about $7 and hold 4 GB each. DDS-4 tapes are fast backups and hold ~100GB each.) However, tapes and other magnetic media can be affected by strong electrical and magnetic fields, heat, humidity, etc. Also, the higher density tapes require more expensive drives (some over $1,000). LTO tape delivers a 2x - 4x savings in operational costs over SSD backup.

In 2010, the record for how much data magnetic tape could store was 29.5GB per square inch. To compare, a quad-layer Blu-ray disc can hold 50GB per disk. Magnetic tapes can be hundreds of feet long. In 2014, Sony announced that it developed new magnetic tape material that can hold 148GB per square inch. With this material, a standard backup tape (the size of an old cassette) could store up to 185TB. To hold the equivalent amount of data would take 3,700 dual-layer 50GB Blu-rays (a stack that would be over 14 feet tall). — extremetech.com

Today (2018), the density of tape storage continues to grow, doubling about every two years.

Compared to hard disks, tape is more reliable (reportedly 1,000 times or more fewer errors), takes zero power to store, and when off-line (unmounted), it is extremely safe from hackers or errors. Modern tape backup units can write data faster than disks can, and can hold much more data. And of course, tape is cheap. Recently (2018), IBM announced a new tape prototype than can hold over 300 TiB on a single tape! The only downside is that to recover the data from tape takes much longer (seconds/minutes) compared to disk (microseconds/milliseconds).

Large data centers such as Google and Microsoft Azure cloud use both disk and tape backups: disk backup at different locations for quick recovery of some errors, and tape backup for safety. (For example, in 2011 Google lost thousands of emails from all disks due to a software bug, but was able to eventually restore all the lost data from their tape backups.)

An external hard drive (less than $100 for 1TB) connected directly to your PC can use the backup program that comes with your operating system (Backup and Restore Center on Windows, and Time Machine on OS X). Most backup software can automate backups of all new files or changed ones on a regular basis. This is a simple option if you only have one PC.

Optical media such as CDs are durable and fairly cheap but take much longer to write. They can be reused less often than magnetic media and are still susceptible to heat and humidity. Optical media can scratch if not carefully handled. Also consider the bulk of the media. If you must store seven years’ worth of backups, it may be important to minimize the storage requirements and expense. A CD-ROM can hold about 700 MiB while a dual-layer Blu-ray can hold 50 GiB. However, if stored and handled correctly, such media can hold data without any maintenance for many decades at least.

A choice becoming popular (since 2008) is on-line storage, e.g., HP Upline, Google GDrive, etc. (for SOHO, Mozy or BackBlaze). (This market changes rapidly so do research on current companies.) The companies offer cheap data storage and complete system backups, provided you have a fast Internet connection. Many colocation facilities (“colos”, usually at network exchange points) provide this service as well to the connected ISPs. If you go this route, make sure all the data is encrypted using industry standard encryption at your site before transmission across the Internet. (Never use any company that uses “proprietary” encryption regardless of how secure they claim it is!) The major danger to this method is the company may not follow best practices to keep data safe and intact, or may simply go out of business without notice.

When backing up large transaction database files, the speed of the media transfer is important. For instance, a 6 Mbps (Megabits per second) tape drive unit will backup 10 gigabytes in about 3 hours and 45 minutes. (In most cases incremental or differential backups contain much less data!)

For legacy IDE controllers, you only choice is a TRAVAN backup drive. Very slow, don’t use! For SCSI drives (such as DDS drives from HP) there are two speeds for the SCSI controller, depending on what devices are on it. A tape drive will slow down the SCSI bus by half, so consider dual SCSI controllers.

For networks, consider a networked backup unit. This would allow a single backup system to be used with many different computers. Thus, you can buy one high-speed device for about the same money as several lower-speed devices. Keep in mind however that a network backup can bring a standard Ethernet network to its knees. (The network only shares 10 Mbps for all users on a SOHO or wireless LAN.) Even a Fast Ethernet (100 Mbps) LAN might suffer noticeable delays and problems.

An excellent choice for single-system backup is a USB disk. Also using SAN/NAS to centralize your storage makes it easy to use a single backup system (robot tapes).

It is a good idea to have a spare media drive (e.g., DLT tape drive), in case the one built into a computer fails when the computer fails. This is especially true for non-standard backup devices that may not be available from CompUSA on a moment’s notice. Regularly clean and maintain (and test) your backup drives.

(While I don’t know of any organization that does this, consider copying old data to new hardware once the old drives are no longer supported or available. If you don’t have a working drive (including drivers), the old backups are useless!)

In the end, most backups still use tape as the most cost-effective solution. Keep in mind that restore from tape can be a very slow, manual process; it may require mounting several tapes to recovery a single file! And if tapes are stored off-site, it may take a day or more just to ship them back to you.

Backup to disk is becoming more popular as the costs of disks go down.

As you can see, many factors must be considered when designing a backup system and its policies. In addition to the ones mentioned earlier, you need to consider the total amount of data to be backed up, total amount to retain, removing old and/or unneeded data, and sanitizing/blinding/anonymizing data that is to be used for reports, research, or training.

Keeping management reports can help answer these questions. Suppose you guessed to keep email backups for 3 months. That’s expensive, but is that too long or too short? An internal study at EMC.com in 2007 determined that for them, over 25% of restore requests were for the same day’s data, and fully 100% of email restore requests were made in two weeks or less after losing email. Without regulatory/legal requirements in their case, they changed the policy and saved a considerable amount of money and effort.

Archival Storage

Magnetic media is not a good choice for very long term, or archival backups. The reason is, over time such media is subject to bit rot: loss of data just from sitting around, even unpowered. How can this happen?

Hard drives use magnetism to store bits of data in disk sectors. These bits can “flip” over time for many reasons, which can lead to data corruption. To mitigate this, hard drives have error-correcting code (ECC) that can sometimes correct flipped bits. However, if enough bits flip then corruption will occur. This will take time; some hard drives have the potential to last with their data intact for decades even if powered down.

Solid-state drives don’t have any moving parts like hard drives. These drives use an insulating layer to trap charged electrons inside microscopic transistors to differentiate between 1s and 0s, in groups (“rows”). Over time, the insulating layer degrades and the charged electrons leak out, thus corrupting data. Powered down, an SSD will see bit rot occur within a few years.

If using such devices for archival backup, it’s a good idea to power them up periodically and let them run fsck. For a hard drive, you should power it up at least once every year or two to prevent the mechanical parts of the drive from seizing up. You should also “refresh” the data by recopying it or use a third-party tool like DiskFresh (which reads then writes each block, checking for bad blocks as it goes). SSDs are a little simpler since they just need to maintain their charge. You can power them up for a few minutes about twice a year, but I still recommend using some tool to check for bad rows.

Even if you kept your backups powered up all the time, bit rot can occur!

Your best choice for archival storage is likely LTO tape if you can afford it. Otherwise, consider optical media for this.

Enterprise Backup Systems

Most enterprise-wide backup systems are designed with a client-server architecture. Each host holding data to be backed up runs a backup client, known as a backup agent. This agent communicates with the (dedicated) backup server. Each backup agent can send messages to, or answer queries from, the backup server.

The server maintains the backup metadata (such as the catalog of what is backed up onto which tapes). In addition, the server sends commands to the agent to gather data and return it, when performing an actual backup operation.

Typically, the backup server is also connected with one or more storage nodes. which is the hardware/software responsible to reading and writing to/from the backup media.

If your organization does not use a SAN for the data, the clients communicate with the server through your network. Network backup can be dangerous, as the network capacity may not be sufficient, or may cause timeouts and excessive latency for other applications using the network at the same time. It’s even worse when the storage nodes are not directly attached to the backup server. Proper backup systems will limit their network utilization and use a staggered schedule.

If you do have a SAN, you attach storage nodes to it and run the server there; no backup data needs to travel on the network (but backup metadata will). If the storage node isn’t directly connected to the SAN unit, it connects through the dedicated storage network (e.g., FibreChannel), which should not cause any utilization or latency problems.

In a smaller setup, direct attached backup can be used to back up data from a single client. The agent and server are just one application, running on the client, and the storage node is also directly attached to the client.

Since tape storage nodes are slow to read and especially slow to write, today many systems put a disk between the backup server and the slower storage node.

Consumables Planning (Budgeting)

Suppose a medium to large organization uses 8 backup tapes a day, 6 days a week, means 48 tapes. If your retention policy is to keep 6 months’ worth of incrementals, that’s 1,248 tapes needed. High capacity DLT tapes might go for $60, so you would need $74,880.00. In the second part of the year, you only need new tapes for full backups, an additional 260 tapes (say) for $15,600, or more than $90k for the first year ($7,500 per month). (Not counting spares or the cost of drive units.) Changes to the policies can result in expense differences of over $1000 per month!

As backup technology changes over the years, it is important to keep old drives around to read old backup tapes when needed. You should keep old drives around long enough to cover your data retention policies.

Try to avoid upgrading your backup technology (drives, tapes, software) every few years, or you’ll end up with many different and incompatible backup tapes. Note the budget must include amortization of the drive expenses.

Tools for Archives and Backups

Archives are easier to make than backups, so most tools create archives. A tool cannot make a “backup” without knowing the underlying filesystem intimately, i.e. it must parse the filesystem on disk. The reason is twofold:

· Different filesystems exhibit different semantics. No single tool supports all the semantics of all filesystem types. You need a different backup tool per FS type.

· The kernel interface obfuscates information about the layout of the file on disk. You have to go around the kernel, direct to the device interface, to see all the information about a file that is necessary for recording it correctly.

If you want to store the kernel’s view of files along with all of the semantics the filesystem provides and none of the non-filesystem objects that might appear to inhabit the filesystem (such as sockets or /proc entries), use the native dump program (and restore) provided by your vendor specifically for that purpose (whatever they name it), for your filesystem type (note for Reiser4Fs you can just use star). dump uses /etc/dumpdates to track dump levels (that is, dump supports differential backups). Some of the differences between dump (for backups) and tar, cpio, or pax for archives are:

1. dump is not confused by object types that the particular operating system has defined as extensions to the standard filesystem; it also does not attempt to archive objects that do not actually reside on the filesystem, e.g. doors and sockets. Consider what GNU tar does to UNIX-domain sockets: it archives them as named pipes. They are not on the filesystem, so they should not be archived at all. dump handles this situation correctly.

2. tar ignores extended attributes (and ACLs unless you use the --acls or the --xattrs option when creating, adding to, or extracting from, the archive), while a native dump program will correctly archive them. (A new extensible backup format known as pax will archive ACLs, SELinux labels, and other meta-data stored in extended attributes. A tool called star uses a similar format. Find out about star on the web.)

3. tar cannot detect reliably where holes are. dump is not confused by files with holes (such as utmp); it will dump only the allocated blocks and restore will reconstruct the file with its original layout.

4. tar uses normal filesystem semantics to read files. That means it modifies the access times recorded in the filesystem inodes, when extracting files. This effectively deletes an audit trail which you may require for other purposes. (Modern Gnu tar has extra options to handle this correctly.) dump parses and records the filesystem outside of kernel filesystem semantics, and therefore doesn’t modify the filesystem in the process of copying it.

Not all filesystem types support dump and restore utilities. When picking a filesystem type, keep in mind your backup requirements.

GNU tar is a popular tool for archiving the user’s view of files. Another standard choice is cpio, rarely used anymore. Note neither tool is standardized by POSIX. A new standard tool, based on both (and hopefully better than either) is pax. These, combined with find and some compression program (such as gzip or bzip2) are used to make portable archives.

You can ask find to locate all files modified since a certain date and add them to a compressed tar archived created on a mounted backup tape drive. A backup shell script can be written, so you don’t end up attempting to backup /dev or /proc files. (See backup script on web page.) (Note! Unix tar ≠ GNU tar; use the GNU version. Unix tar doesn’t handle backups that require multiple tapes.)

For either backups or archives, use crontab to schedule backups according to the backup schedule discussed earlier. (Show ls -d /etc/*cron*.) If your company prefers to have a human perform backups, remember that root permission will be needed to access the full system. Often the backup program is controlled by sudo or a similar facility, so the backup administrator doesn’t need the root password.

The find command can be used to locate which files need to be backed-up. Use “find / -mtime -x” for incrementals and differentials to find files changed since x (you can store x as a timestamp on files, for example /etc/last-backup.{full,incremental,differential} ). Use find with Gnu tar roughly like this:

mount /dev/removable-media /mnt

find / -mtime -1 -depth |xargs tar -rf --xattrs \
/tmp/$$

gzip /tmp/$$; mv /tmp/$$.gz /mnt/incremental-6-20-01

touch /etc/last-backup.incremental

umount /dev/removable-media

(Instead of “-mtime -1” to mean less than 24 hours ago, you can use “-newer x”, where x is some file.)

Commercial software is affordable and several packages are popular for Unix and Linux systems, including “BRU” (www.BRU-Pro.com), VERITAS, Seagate’s BackupEXEC, and “Arkeia” (www.arkeia.com). (I haven’t used these; I just use tar and find.)

Of course, there are free, open source choices as well, such as KDE ark, or amanda (network backups). One of the best is BackupPC. Another is Bacula (or Bareos, a fork of the original). Some of these can create schedules, label tapes, encrypt tapes, follow media rotation schedules, etc.

Be careful of bind mounts and private mounts when performing backups, especially when using home-grown scripts that use find, tar, etc.! Tools such as Bareos will detect symlinks and bind mounts and not “descend into” those, but not all tools will (or may not by default). Bareos backs up the symlink, not duplicates of the files.

In addition, bind mounts are only known to the kernel, in RAM, and are never backed up; you need to list those in fstab to restore those “views”.

Finally, if you don’t run the backup with root privilege, you may only back up the polyinstantiated (a per-user “private view”) part of a directory that the process can see.

The most important tool is the documentation: the backup strategy, media types and rotation schedule, hardware maintenance schedule, location of media storage (e.g., the address of the bank and box number), and all the other information discussed above. This information is collectively referred to as the backup policy. This document should clearly say to users what will be backed up and when, and what to do and who to contact if you need to recover files.

Note: Whatever tools you use, make sure you test your backup method by attempting to use the recovery procedure. (I know someone who spent 45 minutes each working day doing backups for years, only to realize none of the backups ever worked the first time he attempted to recover a file!)

(Parts of this section were adopted from Netnews (Usenet) postings in the newsgroup “comp.unix.admin” during 5/2001 by Jefferson Ogata. Other parts were adopted from The Practice of System and Network Administration, by Limoncelli and Hogan, ©Addison-Wesley.)

Backups with Solaris Zones and Other Containers

Solaris zones, Docker containers, and similar technology contain a complication for backup: many standard directories are actually mounted from the global zone via LOFS (loopback filesystem). These should only be backed up from the global zone. The only items in a local zone needing backup (usually) are application data and configuration files. Using an archive tool (such as cpio, tar, or star) will work best:

find export/zone1 -fstype lofs -prune -o -local \
| cpio -oc -O /backup/zone1.cpio

Whole zones can be fully or incrementally backed up using ufsdump. Shut down the zone before using the ufsdump command to put the zone in a quiescent state and avoid backing up shared file systems, with:

global# zlogin -S zone1 init 0

Solaris supports filesystem snapshots (like LVM does on Linux) so you don’t have to shut off a zone. However, it must be quiesced by turning off applications before creating the snapshot. Then you can turn them back on and perform the backup on the snapshot: Create it with:

global# fssnap -o bs=/export /export/home #create snapshot
global# mount -o ro /dev/fssnap/0 /mnt # then mount it.

You should make copies of your non-global zones’ configurations in case you have to recreate the zones at some point in the future. You should create the copy of the zone’s configuration after you have logged into the zone for the first time and responded to the sysidtool questions:

global# zonecfg -z zone1 export > zone1.config

Adding a backup tape drive

Added SCSI controller (ADAPTEC 2940)

Added SCSI DDS2 Tape drive

On reboot kudzu detected and configured SCSI controller and tape device

Verify devices found with ‘dmesg’: indicate tape is /dev/st0 and /dev/nst0

Verify SCSI devices with ‘scsi_info’ (/proc/scsi)

Verify device working with: mt -f /dev/st0 status

Create link: ln -s /dev/nst0 /dev/tape

Verify link: mt status

Note: /dev/st0 causes automatic tape rewind after any operation, /dev/nst0 has no automatic rewind, but most backup software knows to rewind before finishing. If you plan to put multiple backup files on one tape, you must use /dev/nst0.

Common Backup and Archive Tools:

mt (/dev/mt0, /dev/rmt0)

st (/dev/st0, /dev/nst0 - use nst for no auto rewind)

mt and rmt (remote tape backups); use like: mt -f /dev/tape command, where command is one of: rewind, status, erase, retention, compression, (toggle compression on/off), fsf count (skip forward count files), eod (skip to end of data), eject, ...

dump/restore (These operate on the drive as a collection of disk blocks, below the abstractions of files, links and directories that are created by the file systems. dump backs up an entire file system at a time. It is unable to backup only part of a file system or a directory tree that spans more than one file system.)

tar, cpio, dd, star (and pax and spax)

A comparison of these tools (Note Gnu tar has stolen most of the good ideas from the other tools, accounting for its popularity):

· cpio has many more conversion options than tar and supports many formats, but is a legacy tool rarely used today. Gnu cpio does not support the pax format. (Use “-H ustar”.) The default format used has many problems on modern filesystems, such as crashing with large inode numbers

· cpio can be used as a filter, reading names of files from stdin. (Gnu tar has this ability too.)

· On restore, if there is corruption on a tape tar will stop at that point. cpio will skip over corruption and try to restore the rest of the files.

· cpio is reportedly faster than tar and uses less space (because tar uses 512-byte blocks for every file header, cpio just uses whatever it needs only).

· tar supports multiple hard links on FSes that have 32-bit inode numbers, but cpio can only hand up to 18 bits in the default format (which can be changed with “-H”).

· tar copies a file with multiple hard links once, cpio each time.

· Gnu tar can support archives that span multiple volumes; cpio can too but is known to have some problems with this.

· Modern tar (star) supports extended attributes, used for SE Linux and ACLs. cpio doesn’t in the default format.

· pax is POSIX’s answer to tar and cpio shortcomings. pax attempts to read and write many of the various cpio and tar formats, plus new formats of its own. Its command set more resembles cpio than tar; with find and piping, this makes for a nicer interface. To use extended attributes (including SE Linux labels) and ACLs, the “pax” archive format (or equivalent) must be used and not all systems support this (POSIX required) format. Check available formats with “pax -x help”. Use star or spax instead, if necessary. (Note the LSB requires both pax and star.)

· dd is an old command that copies and optionally converts data efficiently. It can convert data to different formats, block sizes, byte orders, etc. It isn’t generally used to create archives, but is often used to copy whole disks or partitions (to other disks/partitions when the geometry is different), copy large backup files, to create remote archives (“tar ...|ssh ... dd ...”), and to copy and create image files. (The command was part of the ancient IBM mainframe JCL utility set (and has a non-standard syntax as a result); no one knows anymore what the name originally meant.)

There are two caveats to using dd to copy disks (or disk images): If the source and destination drives have different sector sizes (e.g. 512 and 4096 bytes per sector), using dd won’t work well because partition tables (MBR or GPT) contain positions and sizes in sector counts; you’d need to manually overcome that somehow. Secondly, unless the source and destination drives have the exact same size, the GPT backup partition table won’t be copied to the right location (the very end) on the destination drive; the extra space after that won’t be usable. Using tools such as gparted can account for such issues.

libarchive is a portable library for any POSIX system, including Windows, Linux, and Unix, that provides full support for all formats. Currently it includes two front-end tools built with it: bsdtar and bsdcpio. These will support extended attributes and ACLs, when used with the correct options and with the “pax” (and not the default “ustar”) format.

There are two variations of standard dd worth mentioning. Gnu ddrescue is designed to rebuild files from multiple passes through a corrupted filesystem and from other sources. dcfldd is designed to make verifiable copies, suitable for use as evidence. (The name comes from the Department of Defense Computer Forensics Lab, where the tool was invented.)

Some examples of cpio and pax will be shown below. dd and tar were discussed in a previous course. Google will show many tutorials for all these commands, if needed; the man pages are only good for reference.

Additional Points

If you need to backup large (e.g., DB) files, use a larger blocksize for efficiency.

Many types of systems can use LVM, ZFS or some equivalent that supports snapshots for backup without the need to taking the filesystem off-line.

NAS (and some SANS) systems are commonly backed up with some tool that supports NDMP (the Network Data Management Protocol), which usually works by doing background backup to tape of a snapshot. This has a minimal effect on users of the storage system.

If you need to copy file hierarchies (e.g., your home directory and all subdirectories), one popular (and good) way is to use tar, like so:

tar --xattrs -cf - some_directory | \
ssh remote_host 'cd dir && tar --xattrs -xpf -'

You can do this with pax as easily:

cd dir; pax -w -x pax . | \
ssh user@host 'cd /path/to/directory && pax -r -pe'

To ensure the validity of backups and archives, you should compute and compare checksums. Here’s one way when using tar:

tar -cf - dir |tee xyz.tar |md5sum >xyz.tar.md5

Many archiving tools ignore extended attributes (and hence, ACLs and SE Linux labels). Backup and restore by saving ACLs (or all extended attributes) to a file, then applying the file after a restore, as follows:

Backup: cd dir; getfattr -R --skip-base . > backup.attrs
                    use normal backup tool here
Restore: cd dir; use normal backup tool here
                    setfattr --restore=backup.attrs
         rm backup.attrs

Or use an archiving tool that supports extended attributes, ext* attributes, and ACLs: star H=exustar -xattr -c path >archive.tar
and to restore use: star -xattr -x <archive.tar

(The POSIX standard tool pax can support this, but only when using the non-default archive type “pax”. Run “pax -x help” to see if the pax archive type is available on your system. Modern Gnu tar has “--xattrs” option to use with the -c or -x options; using this forces the pax archive type.)

Additional Tools

Jörg Schilling’s star program currently supports archiving of ACLs and extended attributes. IEEE Std 1003.1-2001 (“POSIX.1”) defined the “pax interchange format” that can handle ACLs and other extended attributes (e.g., SELinux stuff). Gnu tar handles pax and star formats. There is also a spax tool that supports the star extensions.

Another tool that supposedly easily and correctly backs up ACLs, ext2/3 attributes, and extended attributes (such as for SELinux) is “bsdtar”, a BSD modified version of tar that uses libarchive.so to read/write a variety of formats.

Amanda (a powerful network backup utility, producing backup schedules automatically but relying on other tools for the actual backups. Most other tools don’t support schedule creation.)

BackupPC (an “enterprise-grade” utility)

Bacula (works very well and is popular but has a steeper learning curve than most. See also Bareos, a fork of Bacula by some of the original developers.)

bru (commercial sw)

Clonezilla (similar to commercial Ghost)

HP Data Protector (commercial sw, used at HCC)

Mondo Rescue

unison (uses rsync)

LuckyBackup (similar to Unison)

vranger (commercial sw, designed for VMware backups, from quest.com)

foremost, ddrescue, ... These are not backup tools, but recovery tools when a filesystem is corrupted and you need to salvage what you can.

Duplicity (Uses rsync to create encrypted tar backups.)

Rsnapshot (A wrapper around rsync. Rsync is not designed for backup, but can be used for that in some cases.) This tool makes a copy of any files modified since the last snapshot (which takes a lot of disk space), and makes a hard link to any unmodified files.

rdiff-backup (stores meta-data in a file, so can easily restore files to alternate systems. Usually produces smaller backups than Rsnapshot, is easier to use, but is slower.) This tool stores the diff between the newest version and the previous version. Thus, restoring a very old version can take a long time. This tool also can provide useful statistics.

Storix (supports AIX & Linux).

s3ql (backs up to Amazon's S3 cloud).

grsync (GUI for rsync).

Other tools can be used to backup (or migrate) data across a network, including tar pipped through SSH, BitTorrent (can use multiple TCP streams at once), and others.

cpio examples:

cd /someplace/.. # the parent of “someplace”
find someplace -depth \
| cpio -oV --format=crc >someplace.cpio
# crc=new SysVr4 format with CRCs

Note that when creating an archive, “-v” means to print filenames as processed; “-V” means to print a dot per file processed.

# Restore all; -d means to create directories if needed;
# -m means to preserve modification timestamps:
cpio -idm < file.cpio

# Restore; wildcards will match leading dot and slashes:
cpio -idum glob-pattern <file.cpio

Without the -d option, cpio won’t create any directories when restoring. Without the -u option, cpio won’t over-write existing files. Add -v to show files as they are extracted (restored).

cpio -tv < file.cpio # table of contents (-i not needed but allowed)

Command to backup all files:
find . -depth -print | \
cpio -o --format=crc > /dev/tape

Command to restore complete (full) backup:
cpio -imd < /dev/tape

Command to get table of contents:
cpio -tv < /dev/tape # -v is long listing

pax Examples (Note you need “-pe” when writing and reading):

# List contents [matching patterns]:
pax -f [-v] files.pax [pattern...]

# Create archives:
find -depth ... | pax -w -pe -x pax > pax.out
pax -w -pe -x pax -f files.pax path... # recursive

# Extract from archive:
pax -r [-v] [pattern...] < files.pax
pax -r [-v] -pe < pax.out # -pe means preserve everything
(spax also has -acl option.)

Avoid using absolute pathnames in archives. tar strips out a leading “/” but cpio and pax do not!

Some versions of pax on Linux do not fully support “-x pax” (such as on Fedora currently); use “-x exustar” instead.

To duplicate some files or a whole directory tree requires more than just copying the files. Today you need to worry about ACLs, extended attributes (SELinux labels), non-files (e.g., named pipes, or FIFOs), files with holes, symlinks and hard links, etc. Gnu cp has options for that but cannot be used to copy files between hosts.

The best way to duplicate a directory tree on the same host is Gnu cp -a, or if that is not available use pax:

pax -pe -rw -x pax olddir newdir

To copy a whole volume to another host you can use dump and then transfer that, then restore it on the remote system.

Tar or pax is often used to duplicate a directory tree to the same host if Gnu cp isn’t available. These tools can also be used to duplicate a directory tree to a different host, via ssh:

tar czf - -C sourcedir files \
| ssh remote_host 'tar xzf - -C destdir'

Use tar with ssh if this is a complete tree transfer. For extra performance, use different compression (e.g., “-j” for bzip2). You may need extra options to control what happens with ACLs, links, etc.

Using rsync over ssh often performs better than tar if it is an update (i.e., some random subset of files need to be transferred). (Show ybsync alias on wpollock.com.)

(Show backup-etc script.)

Files or backups and archives can be copied between hosts with scp or rsync.

Using scp

scp is a simple way to copy files securely between hosts, using SSH. The syntax is simple:

scp file... user@host:path

For example, to copy the file foo to the home directory of ua00 on YborStudent, use the command “scp foo ua00@yborstudent.hccfl.edu:.” (Note that a relative path is relative to the remote user’s home directory.)

You can copy in either direction, and scp has many options to control the attributes of the copied file; see the man page for details.

Using rsync

rsync is a versatile tool that does incremental archives, either locally or across a network. It can also be used to copy files across a network. rsync has about a zillion options but is worth learning. To understand its options (and diagnose performance problems) you need to understand how rsync works.

First, rsync reads last modified timestamp and the length of both the source and destination files. If they are the same, it does not transfer anything. If either is different, rsync reads both the source file and the destination file, and performs progressively smaller hashes on them, to determine which parts differ. After this analysis, rsync copies the similar parts of the destination file to a new file, and then copies the different parts from the source file and inserts them into the proper places in the new file. Finally, rsync copies the updated new file to the destination file location. Note the timestamp of the new file is the current time, not the time of the source file; usually you will need to include the “-t” option to update the destination file’s timestamp to match the source’s timestamp.

You can speed up rsync by eliminating the new file and updating the destination file in-place. However, that is dangerous if your network connection is unreliable. It does preserve hard links though.

rsync uses compression to reduce bandwidth use and make the transfer faster. You can use SSH to make the transfer secure.

You can run rsyncd as a daemon on the remote end (port 873). This makes the transfer go faster (no need to fork rsync each time, and have it calculate file lists each time), more controllable (via a config file), and allows you to push files to (say) a Windows disk that has a different layout and paths. As a server daemon, rsync acts like a modern, super-charged FTP server (and is often used to provide mirror sites on the Internet). You don’t get SSH security however.

A serious performance concern is when copying files from different systems. The timestamps may not match, causing rsync to unnecessarily copy the whole file. This can happen when two systems have clocks that are slightly off. It can also happen when using filesystems with different granularity for the timestamps. For example, most Flash drives use FAT, which records timestamps only to within two seconds. You need to use the rsync --modify-window=1 option in that case, to have rsync treat all timestamps within one second as equal.

The syntax is “rsync options source destination”. Either the source or destination (but not both) can be to remote hosts. To specify a remote location, use “[user@]host:path”. A relative path is relative to the user’s home directory on host. (Filenames with colons can cause problems, not just with rsync. Colons after slashes work fine, so use “./fi:le” instead of “fi:le”.)

Note! If source is a directory, a trailing slash affect how rsync behaves:

rsync -a source host:destination

will copy the directory source into the directory destination, so source/foo becomes destination/source/foo. On the other hand:

rsync -a source/ host:destination

copies the contents of source into the directory destination, so source/foo becomes destination/foo.

One of the reasons to use rsync over tar is the control it provides over what to copy. You can include or exclude files and directories. For example, to exclude all files or directories (and their contents) named .git, add “--exclude=.git”. (If you have files with that name and only want to skip the directories, add a trailing slash.) To exclude just one specific directory, you must specify an absolute pathname. Similarly, use “--exclude=*.o” to skip all files ending in “.o”. You can use glob patterns with some extensions. This can be complex; see the rsync man page or this summary. Note that unlike shell globs, there is a different meaning between patterns with trailing “*”, “**”, and “***”.

If your “--exclude” options (you can specify this multiple times) skip too much, you can add “--include” options to re-include them.

The archive (“-a”) option is a shorthand for several others. It means to preserve permissions, owner and group, timestamps, and symlinks. The “-z” option enables compression. The “-R” option copies pathnames, not just the filenames. The option “-u” says don’t copy files if a newer version exists at the destination.

If you want destination to be an exact copy of source, you also want to delete any files on destination that weren’t present in source; add the --delete option for this.

To make a backup of /home to server.gcaw.org with rsync via ssh:

rsync -avre "ssh -p 2222" /home/ server.gcaw.org:/home
rsync -azv me@server.gcaw.org:documents documents
rsync -azv documents me@server.gcaw.org:documents

rsync -HavRuzc /var/www/html/ example.com:/var/www/html/
# or copy ~/public_html to/from me@example.com:public_html/

rsync -r ~/foo ~/bar # -r means recursive
rsync -a ~/foo ~/bar # -a means archive mode
rsync -az ~/foo remote:foo # copies foo into foo
rsync -az ~/foo/ remote:foo # copies foo's contents
rsync -azu ~/foo/ remote:foo #don't overwrite newer files

The meanings of some commonly used options are:

-v = verbose,
-c = use MD4 (not just size and last-mod time) to see if dest file different than src,
-a = archive mode = -rlptgoD = preserve almost everything,
-r = recursive, -R = preserve src path at dest, -z = compress when sending,
-b = backup dest file before over-writing/deleting,
-u = don’t over-write newer files,
-l = preserve symlinks, -H = preserve hard links, -p = preserve permissions,
-o = preserve owner, -g = preserve group, -t = preserve timestamps,
-D = preserve device files,
-S = preserve file “holes”,
--modify-window=X = timestamps match if diff by less than X seconds
(required with FAT, which only has 2 second time precision)
--delete = remove files from destination not present in source,
-z = compress data at sending end, and decompress at destination

Some other options include using checksums to validate the transfer, renaming existing files at destination (with a trailing “`”) rather than overwrite them, and a bandwidth limiting option so your backup doesn’t overwhelm a LAN.

Modern rsync has many options to control the attributes at the destination. You can use --chmod, xfer ACLs and EAs. You can create rsyncd.conf files, to control behavior (and use a special SSH key to run a specific command), and define new arguments via ~/.popt. But older rsync versions don’t have all those features. For example, old rsync had no options to set/change the permissions when coping new files from Windows; umask applies. (You can use special ssh tricks to work around this, to run a “find ... |xargs chmod...” command after each use of rsync.)

A good way to duplicate a website is to set a default ACL on each directory in your website. Then all uploaded files will have the umask over-ridden:

cd ~/public_html # or wherever your web site is.
find . -type d -exec xargs setfacl -m d:o:rX {} +

This ACL says to set a default ACL on all directories, to provide “others” read, plus execute if a directory. (New directories get default ACL too.) With this ACL, uploading a file will have 644 or 755 permissions, rather than 640 or 750.