Disaster Recovery - Policy and Procedures Outline

A disaster recovery plan is a document that defines the policies and procedures for dealing with various types of disasters that can affect an organization, especially the organization's IT (Information Technology) infrastructure. A disaster is any event that has a significant impact on an enterprise's ability to conduct normal business. This plan includes the information and procedures needed to resume an organization's operation after some sort of disaster. Sometimes the plan is split into several plans, one to address recoverable disasters (e.g., loss of a server) and a more comprehensive business continuity plan for use in total loss situations (e.g., a hurricane takes out New Orleans and businesses must relocate to resume).

A disaster recovery plan may be known by different names such as “business continuity planning” (BCP), “business resumption planning”, “crisis management”, “emergency response planning”, “contingency planning”, “business continuity management”, “security and risk analysis”, and others. Generally speaking, a DRP deals with disasters from an IT perspective, while BCP deals with disasters from a business point of view (e.g., issues such as relocating an office or key personnel replacements).

DRPs should be developed by both management and senior system administrators working together.

Statistic: 94% of companies suffering from a catastrophic data loss do not survive; 43% never reopen and 51% close within two years. (Source: Critical Incident Protocol—A Public and Private Partnership PDF, A report from Michigan State University and the DoJ) Similar numbers are found from other studies and surveys.

The following list of reasons for having a DRP was adopted from an article posted at informit.com, Legal Requirements for Disaster Recovery Planning: Common Facts and Misconceptions, by Leo Wrobel. Dated: Aug 3, 2007.

Many organizations question the value (in a business sense) of the possibly high cost of developing and maintaining such plans. Even a small business or organization, if dependent on an IT infrastructure, should develop some sort of disaster recovery plan.

In many case there are laws and regulations that will require an organization to have such a plan. In addition, some other laws and regulations require an organization to be able to provide certain information. This implies that information be kept safe in case of a disaster, and thus implies an organization will need a disaster recovery plan. Examples include (but by no means limited to):

The Sarbanes Oxley act of 2002 (otherwise known as SOX 2002) is often cited as a reason for disaster recovery planning. This does not require a disaster recovery plan explicitly but does require certain corporate data is available when needed.
Banks and financial institutions must have a plan. The Federal Financial Institutions Examination Council is a formal U.S. interagency ( FRB, FDIC, etc.) body empowered to prescribe uniform principles, standards, and report forms for the federal examination of financial institutions. In March of 2003, the council released its Business Continuity Planning handbook, used by examiners when evaluating disaster recovery plans.
Stockbrokers must have a plan. The National Association of Securities Dealers (NASD) has adopted rules that require all its members to have business continuity plans. As of June 14, 2004, the rules apply to all NASD member firms. The requirements are specified in Rule 3510.)
Power Utilities must have a DRP. The Federal Energy Regulatory Commission (FERC) adopted Title XII of the Energy Policy act of 2005 (16 USC 824o). This authorizes the FERC to create an Electric Reliability Organization (ERO), which will have the capability to adopt and enforce reliability standards for “all users, owners, and operators of the bulk power system” in the United States.
Currently (2014), around the world, accountability and reporting standards require disaster recovery plans for electric and other utilities. In North America for example, North American Electric Reliability Corporation (NERC) is the regulatory agency that requires “CIP compliance: CIP-009 Ensures that recovery plan(s) are put in place for Critical Cyber Assets and that ... plans follow established business continuity and disaster recovery techniques and practices. ...”. (Although NERC has been active since 1968, it was a different organization before 2006, when it became the FERC's ERO.)
Telecommunications utilities (telecoms) should have disaster recovery plans, but under current law may not. Telecommunications utilities are governed on the federal level by the Federal Communications Commission (FCC) for interstate services and by state Public Utility Commissions (PUCs) for services within the state. The FCC has created the Network Reliability and Interoperability Council (NRIC), which develops recommendations for the FCC and telecoms to “insure [sic] optimal reliability, security, interoperability and interconnectivity of, and accessibility to, public communications networks and the Internet”. Unfortunately, there doesn't seem to be any provision that requires telecommunications carriers to have a disaster recovery plan. Worse, most telecoms (and Internet Service Providers) waive most liability.
Health care providers must have a DRP. HIPAA is an acronym for the Health Insurance Portability and Accountability act of 1996, Public Law 104-191, which amended the Internal Revenue Service Code of 1986 (also known as the Kennedy-Kassebaum act). The legislation called upon the Department of Health and Human Services (HHS) to publish new rules that will ensure security standards protecting the confidentiality and integrity of “individually identifiable health information,” past, present, or future. This requires covered entities to ensure the confidentiality, integrity, and availability of all electronic protected health information (ePHI) that the covered entity creates, receives, maintains, or transmits. It also requires entities to protect against any reasonably anticipated threats or hazards to the security or integrity of ePHI, protect against any reasonably anticipated uses or disclosures of such information that are not permitted or required by the Privacy Rule, and ensure compliance by their workforce. Required safeguards include application of appropriate policies and procedures, safeguarding physical access to ePHI, and ensuring that technical security measures are in place to protect networks, computers and other electronic devices. You can read more about this from the NIST HIPAA Guide.
In 2009, HIPAA grew more teeth with the passage of ARRA (The American Recovery and Reinvestment act), specifically the HITECH act. HITECH Section D details many changes to privacy, breach notification, required audits, and increased penalties.
The Department of Labor has adopted numerous rules and regulations in regard to workplace safety of companies with more than 10 employees as part of the Occupational Safety and Health act (OSHA). This law (29 USC 654) is usually construed to mean such companies must have disaster recovery policies, not so much to protect the IT infrastructure or to limit legal liability, but to protect employees' health and safety especially in the event of a disaster.

A few other laws and regulations that may mandate having a disaster recovery plan (or indirectly require one) include:

The Foreign Corrupt Practices act of 1977.
Internal Revenue Service (IRS) Law for Protecting Taxpayer Information.
Food and Drug Administration (FDA) mandated requirements.
Homeland Security and terrorist prevention.
Pandemic (Bird Flu) prevention.
ISO 9000 certification.
Requirements for radio and TV broadcasters.
Contractual obligations to customers, business partners.
Document protection and retention laws, including the FRCP (the Federal Rules of Civil Procedure).
Personal identity theft laws and insurance.
GLB (Gramm-Leach-Bliley act).
DMCA (Digital Millennium Copyright act).
FMA (Financial Management and Accountability act).
FERPA (Family Education Rights and Privacy act).
CFAA (Computer Fraud and Abuse act).
The US Security and Exchange Commission (SEC) Office of Compliance Inspections and Examinations (OCIE) Cybersecurity Initiative (PDF).
Local (city, county, state) and industry specific laws and regulations.
European Union Data Protection Directive, Singapore MAS regulations, and other International (and foreign country) laws and regulations.

You as system administrator are at least partially responsible for making your organization compliant with applicable laws and regulations that affect the organization's IT infrastructure. You should have a talk with the organization's legal representative and make sure they find out what requirements pertain to your situation. Note that even for a lawyer, there is no way to easily or accurately keep up with all the changes in the laws and regulations. Every so often a reassessment should be made to make sure your plans are in compliance. (I.e., when the laws or regulations change, when your contracts change, when the nature of your business changes or expands.) In some cases you may be required to review your DRP annually or every few years. Typically the DRP should be reviewed once every one to three years.

There are numerous standards, laws, and regulations that will require compliance, and several countries and industries will require periodic audits for compliance. Some international standards to consider for this include ISO/IEC 27001, 27002, and 27005.

Disaster recovery is all about risk management. The cost of ignoring disasters can be very high, including total collapse of the enterprise. As noted above, in most cases there are legal penalties for not having a proper plan. Building a good DRP requires an understanding of your business, your IT infrastructure, and your legal, regulatory, and fiduciary responsibilities.

The first step is to understand the risks your enterprise faces. This is often called a Business Impact Analysis. You need to answer the following questions:

What are the most critical functions or systems for your organization?
What would be the impact if these were severely interrupted?

(For example, for a college you could ask “Could the system being down impact instruction, research, grants, or external funding?”)

Before working out disaster recovery policy or procedure documents, you must know what the budget will be. The budget is the annualized cost of doing nothing. This in turn requires a risk analysis, which requires determining the risks for the systems identified in the business impact analysis. Doing a risk analysis is difficult and is not needed frequently, so the best advice may be to use a consultant for this who specializes in that area. The budget should be documented in your DRP. You can roughly determine a budget for each type of disaster that has a significant probability of occurence with this formula:

budget_disaster = cost_disaster × probability_disaster

Implementing a DRP includes implementing various mitigation measures.

Mitigation measures are techniques, policies, and procedures that either reduce the impact (expense) of a disaster, reduce the time to restore vital services, reduce the probability of the disaster occurring, or some combination of these. Mitigation measures are also known as mitigation strategies, measures, preventative measures, or simply mitigations.

Mitigation measures can include various types of disaster insurance (some insurance companies offer specialized IT insurance policies for this). Often several mitigation measures can be used together, with one amplifying the effect (or reducing the cost to implement) of the others. When performing the risk analysis the various possibilities can be compared by applying this formula for each possible group of mitigations options, summing the dollar amounts for each type of disaster:

( cost_disaster – ( savings_mitigation – cost_mitigation ) ) × probability_disaster

(Where “probability_disaster” is the probability of the disaster occurring in one year, after applying mitigations.) Some mitigation measures can be very effective and inexpensive, while others can be very expensive but effective. (Some measures are not effective at all, in some situations.) Also remember that some mitigation measures are likely required for any organization regardless of cost.

Some examples: If your site is in an area where the risk of a serious flood occurring in the next 12 months is 1 chance in a million, and the cost of the flood would likely be $10 million, then the cost of doing nothing is expected to be $10 per year. (This is the budget amount.) In this case it wouldn't be cost effective to implement any mitigations that cost money, not even buy flood insurance. However, you may have legal requirements to have a flood evacuation plan for employee safety, or to implement some other minimal mitigations. You could also implement low cost mitigations, such as putting critical IT servers on the second floor instead of the ground floor (when moving in; changing this later will have a cost).

On the other hand if you are located in an area where the chance of an earthquake costing a probable $50 million to recover is 1 chance in 5,000 per year, the budget is expected to be $10,000. You should consider various combinations of mitigations that give the most savings after expenses, and that include all required mitigations. (I.e., you chose the group of mitigations with the best benefit to cost ratio.)

Various common mitigations will be discussed below, when discussing avoiding disasters. For more information about risk analysis, see www.security-risk-analysis.com/introduction.htm.

A DRP is composed of several related documents. There are two different types of documents to consider: policy and procedure documents. The policy documents say what to do but should not mention any means for doing so. Policy documents also define expected behaviors of employees. On the other hand procedure documents say how to perform specific tasks, in order to fulfill some policy. The two types of documents should be cross-referenced, but if small enough (e.g., for a small home / small office, or SOHO), the two documents could be combined into a single document.

Often the corporate leaders decide overall policy, and it is up to others (such as management and senior system administrators) to design the specific, detailed policies and the procedures that implement them. Owing to different applicable laws in different locations, each site usually must design and implement the specific policies and procedures independently.

Since creating sensible IT policy documents is difficult (often the obvious policy is not the wisest policy) in many cases IT staff are involved in setting policies, not just the procedures. If not and unrealistic policies are “handed down from on-high”, a system administrator should try to find a way to suggest changes that won't embarrass management (which would not a good way either to effect changes or to get a raise). As noted earlier an organization often hires a specialist consultant to develop these documents, hopefully one well versed in local laws and applicable regulations. (A google.com search for Disaster Recovery Planning will turn up many.) There are many related policies to DRP, including security and backup policies. Often it can save money to have a consultant help with all related documents, at the same time.

DRP documents are very critical and highly confidential. You should never place a real one on a web server for example, unless you are sure that those web pages cannot be accessed by non-employees. At the same time it is important that all the people involved have copies of the current policies and procedures documents both from their office and from an off-site location. (Use the security features of an Internet web server, or a separate intranet web server not accessible by outsiders.) Copies must be available (especially) during a disaster, even during a power loss, and from a remote location (such as an evacvuation shelter).

The policies and procedures must be very detailed. Vague directions won't be followed! For example, in the event of a server being attacked and wiped out by a hacker a procedure that simply says “notify the police” is not likely to work. Have you called the police to see if they handle this sort of problem? If so, who (which department) should be called and what information will they need? It is likely local police won't handle this sort of problem and the correct procedure might be to notify a different law enforcement group; or instead the various senior managers, the public relations office, and/or maybe the company's insurance company (to file a claim) should be notified. Of course the technical procedures must be spelled out too (e.g., the procedure for restoring the server from a backup, or activating a standby server).

Any DRP should clearly address these issues at a minimum:

Who will be in-charge (and the chain of command)
Who will be the PR contact (i.e., who will handle the phone)
Who must be informed
What must be done regularly (and when)
What must be done when a disaster is imminent
What must be done during a disaster
What must be done after a disaster has struck

The disaster recovery policy can be thought of as a contract between an organization and those providing services (either in-house system administrators or outside contractors). Viewed this way, the DRP provides what is known as a service level agreement, or SLA. The SLA states policies such as what services are provided and a time frame for recovery from different types of disasters. Your policy should have a clear SLA so others know what to expect for recovery times of various services.

Contact information includes persons and organizations that should be informed of various types of disasters (the company president and/or board of trustees, campus deans, major customers, the local media including radio, TV, ...), insurance agents, etc.

Contact information includes service provider contact information (e.g., the electrician, the plumber, security company, police, ISP, gas, water, etc.). Often the organization's webmaster must be notified in order to post updated information on a web server or to switch to some backup website, so include that information too. Service contact information should include names, titles, phone numbers (work, home, and cell), email addresses (not your local email!), and account numbers. This must be keep handy in hard copy. An off-site copy must be maintained too.

Note that in any policy or procedure document, specific locations and other information may change over time. It is easier if you put this data into an appendix and use generic phrases such as “off-site backup storage facility” rather than specific addresses such as “123 ReelToReel Lane”. The same holds for contact names. Assign tasks to function titles (or “roles”) and only use these title names in the DRP. Note a given role (say “Plumber”) may be filled by more than one person/company, and also a given person/company may serve in several capacities at the same time.

The contact information appendix of a DRP should be sorted by role and generic names, and should list the names of companies, organizations, or people that currently fulfill those functions and the specific locations (and other data) that currently fill those generic names (“off-site storage”, “backup web server”, “Electrician”, “ISP”, ...). The date of last update should be included.

Such contact and related information in this appendix should be regularly maintained. (And this task should be assigned to someone in the DRP!)

One last point: In a disaster it may not be possible to reach some of the key personnel listed. Also if the disaster is protracted, those with families may not be able to stay to handle your organization's disaster. You should make a clear chain of command, so if someone is unavailable everyone knows who will then take charge.

You can't document every type of disaster that might befall an organization's servers and networks. (For example, few companies have a plan on how to handle a swarm of moths shorting out computers.) Make sure your plan includes some general policy guidelines to cover any cases not specifically mentioned. In fact this can help keep your documents much shorter than otherwise.

Some types of disasters you should specifically plan for include:

Physical Break-ins: theft and/or destruction, terrorist attacks
Remote attacks: attempts to steal, destroy, or corrupt data, theft of service, denial of service (DoS), computer viruses
Hardware failures: servers, databases, networks, power outages
Environmental disasters: fire, flood, hurricane, etc. (Generally all these result in power outages too)
Accidents (human error): file loss, DB record loss, data corruption
Other disruption: disgruntled employees, organized criminal activity, strikes, legal actions (e.g., shutdown orders), etc.

Low (or no) cost mitigations should be used whatever the budget. Some of these are discussed below (Avoiding Disasters). Often a group of mitigations can be used very effectively together.

A vital step to take in advance is to determine exactly who is responsible for what. As mentioned previously (Contact and Other Data) the best way to document this is to come up with roles, such as “PR contact”, “network manager”, “key contact (person in charge)”, etc. Then write the DRP using only the role names, clearly indicating the responsibilities of each. In an appendix you can then list the current personnel that are assigned each role, including phone, fax, home phone, email, and any other contact information. Note a given person may be assigned multiple roles. In a small company a single person may have all roles. On the other hand, in a large organization you may need several people to fulfill a single role (such as handling the phones).

The person in charge is usually in upper management. It is a mistake to list a IT person as in charge, even a senior administrator. The person in charge must have the authority to make policy decisions, such as closing the school early or directing the PR contact to make announcements to the media. However the policy should be that the person in charge must consult with some IT personnel before making vital IT related decisions. A foolish decision made without understanding the technical issues involved can cost dearly.

An often over-looked step is to implement DRP training for key (or all) personnel. Without some training it is unlikely your plan will prove effective once a disaster occurs.

Remember to establish a clear chain of command. If some key person is unavailable, without a chain of command nobody will know who is in charge or who reports to whom. You can find good advice on this at www.osha.gov/SLTC/etools/ics/what_is_ics.html.

There are a number of techniques that can be used to reduce or eliminate the probability of some disasters. (Of course you can't completely eliminate the risk of disasters!) These mitigation measures often also reduce the cost or time needed for disaster recovery. You should use as many of these mitigation strategies as makes sense for your DRP:

Store key data off-site. The location and access information must be documented in your DRP. Types of key data and documents to store off-line (and perhaps off-site) include system logs, backups, hardware inventories and configurations, /etc/passwd and /etc/shadow (and other /etc/* files), network maps (showing connections, IP address assignments, DNS data, etc.), serial numbers for all equipment, software keys, licenses and permits, room keys (and combinations for locks), and any other security information (such as the root password for your servers).
Keep paper copies of vital data (including your DRP).
Keep information (contact information, passwords, ...) current.
Use anti-virus and malware removal software.
Use and regularly test UPS, fire and smoke sensors and alarms, anti-theft systems.
Have INFOSEC and compliance (e.g., Sarbanes-Oxley) assessments and evaluations (also known as audits) done at least once after any major IT infrastructure changes.
Test disaster recovery plan by staging a disaster drill. Do every 1–3 years, more often if a lot has changed since the last drill (such as key personnel turnovers) or if your personnel need the practice. Tell people in advance, and also fire, police, ISP, and others you are staging a drill at a specific time. Since you also should review the DRP every 1–3 years, it makes sense to do this test after the review, and possible changes. See Weathering the Unexpected disaster recovery testing reference.)
Large data centers may have an issue with power failures. When power is restored, hundreds or thousands of computers, routers, switches, and hard disks all try to start at the same time. Not only is it unlikely that all the systems will come up without error, but the power surge may trigger another power failure! You can test this with a disaster drill, where you actually turn off the power, then turn it back on, and monitor what happens. (You can stagger when machines actually turn back on using various techniques, such as slow start services.
Maintain systems, including regular inspections (e.g., change A/C filters, examine fire extinguishers, change batteries regularly in smoke detectors and UPSes). Such disaster preventative measures should be clearly documented in your DRP, including who is responsible for doing what.
Have a backup ISP (say via cheap ISDN line), backup email and possibly other backup servers in different geographical locations. (Often a reciprocal agreement can be made between East and West coast companies to host each other's services in case of emergency.)
Conduct training sessions.

Once a disaster is imminent, has occurred, or is occurring, you need to activate the relevant DRP procedures. (Of course you are already well prepared!) The first step is to locate and review your copy of the DRP.

You must understand your DRP role. Know who you must notify, especially to protect legal rights and to avoid charges of negligence. Be certain you understand the chain of command; know who you should report to and take direction from, and who should report to you.

You must know your company policy regarding disasters, especially break-ins and other attacks. Some common policies include phoning the corporate attorney, the president or board members, and others in the company, and let them follow-up. A company may fear negative publicity more than the loss from a disaster, so the policy may be not to report the problem to anyone outside the company. Sometimes you report to the person in charge of publicity (marketing) and let them choose.

When planning your DRP you can contact your ISP and local law enforcement to see what procedures they recommend. Often government agencies such as the FBI (www.fbi.gov), police or other local law enforcement, FEMA (www.fema.gov), CERT (www.cert.org), US-CERT (www.us-cert.gov), and others should be notified in the event of an attack (although the FBI won't take action unless the loss is above $5000 or so, and won't give priority unless the loss is much greater).

If the loss affects the customers it may be required by various laws and regulations to report the disaster even if your company would prefer you not to. You should become familiar with the laws governing your organization's particular situation. Even if not required to report the problem in some cases the policy may be to report the problem to major customers.

In real life an attorney is consulted early to determine policies and procedures to follow that are required by law or by industry regulations or that are just a good idea to limit your company's legal liability. (For your class project it is OK to make this stuff up; that is you can pretend a lawyer said that you must have daily backups off-site, that you must notify the police, etc.)

As a professional system or network administrator you have responsibility to obey the laws and applicable regulations. If you feel your organization's policy is illegal or unethical you should work to resolve the issue early on. Otherwise you may be required to “whistle-blow” when a disaster strikes; this will not enhance your job prospects!

When a disaster is imminent it is a good time to perform backups, update system journals, contact backup sites (to let them know to get ready), send the documents and backups that need to be off-site, and other preparation steps. These are called proactive measures. This is also a good time to obtain a hardcopy of the current DRP and review it.

The specific steps to take after disaster has struck are usually known as reactive measures. Some specific measures that should be addressed in the case of a school such as HCC include:

What specifically should be done if a hacker defaces (or completely wipes out) the main web server?
Who is to be phoned in the event of a school closure? (And who has the power to order a school closure?)
Are there backup (web, DNS, email) servers off-site and if so what must be done to activate them?
What is the time frame for a recovery of lost data? Or a server crash?
What should be done if a DoS attack prevents on-line registration the week or two before a new semester starts?

Beside disaster recovery a company may (and should) have other policy and procedure documents. You may be asked to write such documents related to IT. You need to cover such topics as acceptable use of company equipment (e.g., the computers), data (e.g., customer lists), and services (e.g., email), strategic plans to replace desktop computers every so often, and so on. In addition you should inform employees about any privacy policies and related matters (e.g., password policies).

Policies and procedures that employees need to know about should be accessible, including items such as equipment use forms, account request forms, password reset procedure, etc. A good idea is to use a web server for all this and include an index and a search engine if you have a lot of documents.

Determine what mitigations are required for your enterprise.
Perform a business impact analysis.
Perform a cost analysis and determine the budget.
Decide what mitigations, policies, and procedures are reasonable in your situation. Make sure these make sense to upper management, legal department, and senior IT staff.
Conduct now, and schedule regularly, an audit of the DRP. This will ensure compliance, minimize liability, and reduce hazards to employees and others.
Distribute the DRP to personnel.
Implement the pre-disaster mitigations.
Decide on schedules for reviewing (and revising) the DRP. Keep the contact information current.
Decide on a schedule for conducting a DRP drill. (Before a drill is a good time to review and revise the DRP.)

Many Sample DRPs can be seen at www.drj.com.
“Planning”, a chapter of the book “Disaster Recovery Planning: Preparing For The Unthinkable” by Jon Toigo.
www.disasterrecoveryworld.com is a commercial site that also provides excellent resources, and explains the COBRA method of analysis.
www.security-risk-analysis.com.
www.crisis-management-and-disaster-recovery.com.
Weathering the Unexpected (An article in ACM Queue on Google's disater recovery testing, by Kripa Krishnan, dated September 16, 2012).
www.itil-itsm-world.com/itil-8.htm.
business continuity planning / management (BCM) from wikipedia.org.
www.FindWhitePapers.com/storage/backup-and-recovery.
www.osha.gov/SLTC/etools/ics/what_is_ics.html (advice on how to establish an incident command system, resulting in a clear chain of command).
T. A. Limoncelli et. al., The Practice of System and Network Administration, second edition, chapter 10, ©2007 Addison-Wesley. This is my favorite system administration book, used heavily when creating this web page. The authors of this book recommend the following three books:

Disaster Recovery — Policy and Procedure Outline

©2001–2015 by Wayne Pollock, Tampa Florida USA.

Disaster Recovery Planning (DRP)

Why Have a Disaster Recovery Plan (DRP)?

Compliance

Risk Management

Anatomy of a Disaster Recovery Plan

Service Level Agreements (SLA)

Contact and Other Data

Types of Disasters to Plan For

Preparing for Disaster

Avoiding Disasters (Mitigation Measures)

Coping with Disasters: What to do During and After a Disaster Strikes

Other Policies and Procedures

Summary

References