This document may not be copied by any means without the prior consent of the author. (Contact information for the author appears at the end of this document.)
A disaster recovery plan is a document that defines the policies and procedures for dealing with various types of disasters that can affect an organization, especially the organization's IT (Information Technology) infrastructure. A disaster is any event that has a significant impact on an enterprise's ability to conduct normal business. This plan includes the information and procedures needed to resume an organization's operation after some sort of disaster. Sometimes the plan is split into several plans, one to address recoverable disasters (e.g., loss of a server) and a more comprehensive business continuity plan for use in total loss situations (e.g., a hurricane takes out New Orleans and businesses must relocate to resume).
A disaster recovery plan may be known by different names such as “business continuity planning” (BCP), “business resumption planning”, “crisis management”, “emergency response planning”, “contingency planning”, “business continuity management”, “security and risk analysis”, and others. Generally speaking, a DRP deals with disasters from an IT perspective, while BCP deals with disasters from a business point of view (e.g., issues such as relocating an office or key personnel replacements).
DRPs should be developed by both management and senior system administrators working together.
Statistic: 94% of companies suffering from a catastrophic data loss do not survive; 43% never reopen and 51% close within two years. (Source: Critical Incident Protocol—A Public and Private Partnership PDF, A report from Michigan State University and the DoJ) Similar numbers are found from other studies and surveys.
The following list of reasons for having a DRP was adopted from an article posted at informit.com, Legal Requirements for Disaster Recovery Planning: Common Facts and Misconceptions, by Leo Wrobel. Dated: Aug 3, 2007.
Many organizations question the value (in a business sense) of the possibly high cost of developing and maintaining such plans. Even a small business or organization, if dependent on an IT infrastructure, should develop some sort of disaster recovery plan.
In many case there are laws and regulations that will require an organization to have such a plan. In addition, some other laws and regulations require an organization to be able to provide certain information. This implies that information be kept safe in case of a disaster, and thus implies an organization will need a disaster recovery plan. Examples include (but by no means limited to):
Currently (2014), around the world, accountability and reporting standards require disaster recovery plans for electric and other utilities. In North America for example, North American Electric Reliability Corporation (NERC) is the regulatory agency that requires “CIP compliance: CIP-009 Ensures that recovery plan(s) are put in place for Critical Cyber Assets and that ... plans follow established business continuity and disaster recovery techniques and practices. ...”. (Although NERC has been active since 1968, it was a different organization before 2006, when it became the FERC's ERO.)
In 2009, HIPAA grew more teeth with the passage of ARRA (The American Recovery and Reinvestment act), specifically the HITECH act. HITECH Section D details many changes to privacy, breach notification, required audits, and increased penalties.
A few other laws and regulations that may mandate having a disaster recovery plan (or indirectly require one) include:
You as system administrator are at least partially responsible for making your organization compliant with applicable laws and regulations that affect the organization's IT infrastructure. You should have a talk with the organization's legal representative and make sure they find out what requirements pertain to your situation. Note that even for a lawyer, there is no way to easily or accurately keep up with all the changes in the laws and regulations. Every so often a reassessment should be made to make sure your plans are in compliance. (I.e., when the laws or regulations change, when your contracts change, when the nature of your business changes or expands.) In some cases you may be required to review your DRP annually or every few years. Typically the DRP should be reviewed once every one to three years.
There are numerous standards, laws, and regulations that will require compliance, and several countries and industries will require periodic audits for compliance. Some international standards to consider for this include ISO/IEC 27001, 27002, and 27005.
Disaster recovery is all about risk management. The cost of ignoring disasters can be very high, including total collapse of the enterprise. As noted above, in most cases there are legal penalties for not having a proper plan. Building a good DRP requires an understanding of your business, your IT infrastructure, and your legal, regulatory, and fiduciary responsibilities.
The first step is to understand the risks your enterprise faces. This is often called a Business Impact Analysis. You need to answer the following questions:
(For example, for a college you could ask “Could the system being down impact instruction, research, grants, or external funding?”)
Before working out disaster recovery policy or procedure documents, you must know what the budget will be. The budget is the annualized cost of doing nothing. This in turn requires a risk analysis, which requires determining the risks for the systems identified in the business impact analysis. Doing a risk analysis is difficult and is not needed frequently, so the best advice may be to use a consultant for this who specializes in that area. The budget should be documented in your DRP. You can roughly determine a budget for each type of disaster that has a significant probability of occurence with this formula:
budgetdisaster = costdisaster × probabilitydisaster
Implementing a DRP includes implementing various mitigation measures.
Mitigation measures are techniques, policies, and procedures that either reduce the impact (expense) of a disaster, reduce the time to restore vital services, reduce the probability of the disaster occurring, or some combination of these. Mitigation measures are also known as mitigation strategies, measures, preventative measures, or simply mitigations.
Mitigation measures can include various types of disaster insurance (some insurance companies offer specialized IT insurance policies for this). Often several mitigation measures can be used together, with one amplifying the effect (or reducing the cost to implement) of the others. When performing the risk analysis the various possibilities can be compared by applying this formula for each possible group of mitigations options, summing the dollar amounts for each type of disaster:
( costdisaster – ( savingsmitigation – costmitigation ) ) × probabilitydisaster
(Where “probabilitydisaster” is the probability of the disaster occurring in one year, after applying mitigations.) Some mitigation measures can be very effective and inexpensive, while others can be very expensive but effective. (Some measures are not effective at all, in some situations.) Also remember that some mitigation measures are likely required for any organization regardless of cost.
Some examples: If your site is in an area where the risk of a serious flood occurring in the next 12 months is 1 chance in a million, and the cost of the flood would likely be $10 million, then the cost of doing nothing is expected to be $10 per year. (This is the budget amount.) In this case it wouldn't be cost effective to implement any mitigations that cost money, not even buy flood insurance. However, you may have legal requirements to have a flood evacuation plan for employee safety, or to implement some other minimal mitigations. You could also implement low cost mitigations, such as putting critical IT servers on the second floor instead of the ground floor (when moving in; changing this later will have a cost).
On the other hand if you are located in an area where the chance of an earthquake costing a probable $50 million to recover is 1 chance in 5,000 per year, the budget is expected to be $10,000. You should consider various combinations of mitigations that give the most savings after expenses, and that include all required mitigations. (I.e., you chose the group of mitigations with the best benefit to cost ratio.)
A DRP is composed of several related documents. There are two different types of documents to consider: policy and procedure documents. The policy documents say what to do but should not mention any means for doing so. Policy documents also define expected behaviors of employees. On the other hand procedure documents say how to perform specific tasks, in order to fulfill some policy. The two types of documents should be cross-referenced, but if small enough (e.g., for a small home / small office, or SOHO), the two documents could be combined into a single document.
Often the corporate leaders decide overall policy, and it is up to others (such as management and senior system administrators) to design the specific, detailed policies and the procedures that implement them. Owing to different applicable laws in different locations, each site usually must design and implement the specific policies and procedures independently.
Since creating sensible IT policy documents is difficult (often the obvious policy is not the wisest policy) in many cases IT staff are involved in setting policies, not just the procedures. If not and unrealistic policies are “handed down from on-high”, a system administrator should try to find a way to suggest changes that won't embarrass management (which would not a good way either to effect changes or to get a raise). As noted earlier an organization often hires a specialist consultant to develop these documents, hopefully one well versed in local laws and applicable regulations. (A google.com search for Disaster Recovery Planning will turn up many.) There are many related policies to DRP, including security and backup policies. Often it can save money to have a consultant help with all related documents, at the same time.
DRP documents are very critical and highly confidential. You should never place a real one on a web server for example, unless you are sure that those web pages cannot be accessed by non-employees. At the same time it is important that all the people involved have copies of the current policies and procedures documents both from their office and from an off-site location. (Use the security features of an Internet web server, or a separate intranet web server not accessible by outsiders.) Copies must be available (especially) during a disaster, even during a power loss, and from a remote location (such as an evacvuation shelter).
The policies and procedures must be very detailed. Vague directions won't be followed! For example, in the event of a server being attacked and wiped out by a hacker a procedure that simply says “notify the police” is not likely to work. Have you called the police to see if they handle this sort of problem? If so, who (which department) should be called and what information will they need? It is likely local police won't handle this sort of problem and the correct procedure might be to notify a different law enforcement group; or instead the various senior managers, the public relations office, and/or maybe the company's insurance company (to file a claim) should be notified. Of course the technical procedures must be spelled out too (e.g., the procedure for restoring the server from a backup, or activating a standby server).
Any DRP should clearly address these issues at a minimum:
The disaster recovery policy can be thought of as a contract between an organization and those providing services (either in-house system administrators or outside contractors). Viewed this way, the DRP provides what is known as a service level agreement, or SLA. The SLA states policies such as what services are provided and a time frame for recovery from different types of disasters. Your policy should have a clear SLA so others know what to expect for recovery times of various services.
Contact information includes persons and organizations that should be informed of various types of disasters (the company president and/or board of trustees, campus deans, major customers, the local media including radio, TV, ...), insurance agents, etc.
Contact information includes service provider contact information (e.g., the electrician, the plumber, security company, police, ISP, gas, water, etc.). Often the organization's webmaster must be notified in order to post updated information on a web server or to switch to some backup website, so include that information too. Service contact information should include names, titles, phone numbers (work, home, and cell), email addresses (not your local email!), and account numbers. This must be keep handy in hard copy. An off-site copy must be maintained too.
Note that in any policy or procedure document, specific locations and other information may change over time. It is easier if you put this data into an appendix and use generic phrases such as “off-site backup storage facility” rather than specific addresses such as “123 ReelToReel Lane”. The same holds for contact names. Assign tasks to function titles (or “roles”) and only use these title names in the DRP. Note a given role (say “Plumber”) may be filled by more than one person/company, and also a given person/company may serve in several capacities at the same time.
The contact information appendix of a DRP should be sorted by role and generic names, and should list the names of companies, organizations, or people that currently fulfill those functions and the specific locations (and other data) that currently fill those generic names (“off-site storage”, “backup web server”, “Electrician”, “ISP”, ...). The date of last update should be included.
Such contact and related information in this appendix should be regularly maintained. (And this task should be assigned to someone in the DRP!)
One last point: In a disaster it may not be possible to reach some of the key personnel listed. Also if the disaster is protracted, those with families may not be able to stay to handle your organization's disaster. You should make a clear chain of command, so if someone is unavailable everyone knows who will then take charge.
You can't document every type of disaster that might befall an organization's servers and networks. (For example, few companies have a plan on how to handle a swarm of moths shorting out computers.) Make sure your plan includes some general policy guidelines to cover any cases not specifically mentioned. In fact this can help keep your documents much shorter than otherwise.
Some types of disasters you should specifically plan for include:
Low (or no) cost mitigations should be used whatever the budget. Some of these are discussed below (Avoiding Disasters). Often a group of mitigations can be used very effectively together.
A vital step to take in advance is to determine exactly who is responsible for what. As mentioned previously (Contact and Other Data) the best way to document this is to come up with roles, such as “PR contact”, “network manager”, “key contact (person in charge)”, etc. Then write the DRP using only the role names, clearly indicating the responsibilities of each. In an appendix you can then list the current personnel that are assigned each role, including phone, fax, home phone, email, and any other contact information. Note a given person may be assigned multiple roles. In a small company a single person may have all roles. On the other hand, in a large organization you may need several people to fulfill a single role (such as handling the phones).
The person in charge is usually in upper management. It is a mistake to list a IT person as in charge, even a senior administrator. The person in charge must have the authority to make policy decisions, such as closing the school early or directing the PR contact to make announcements to the media. However the policy should be that the person in charge must consult with some IT personnel before making vital IT related decisions. A foolish decision made without understanding the technical issues involved can cost dearly.
An often over-looked step is to implement DRP training for key (or all) personnel. Without some training it is unlikely your plan will prove effective once a disaster occurs.
Remember to establish a clear chain of command. If some key person is unavailable, without a chain of command nobody will know who is in charge or who reports to whom. You can find good advice on this at www.osha.gov/SLTC/etools/ics/what_is_ics.html.
There are a number of techniques that can be used to reduce or eliminate the probability of some disasters. (Of course you can't completely eliminate the risk of disasters!) These mitigation measures often also reduce the cost or time needed for disaster recovery. You should use as many of these mitigation strategies as makes sense for your DRP:
/etc/*files), network maps (showing connections, IP address assignments, DNS data, etc.), serial numbers for all equipment, software keys, licenses and permits, room keys (and combinations for locks), and any other security information (such as the root password for your servers).
Once a disaster is imminent, has occurred, or is occurring, you need to activate the relevant DRP procedures. (Of course you are already well prepared!) The first step is to locate and review your copy of the DRP.
You must understand your DRP role. Know who you must notify, especially to protect legal rights and to avoid charges of negligence. Be certain you understand the chain of command; know who you should report to and take direction from, and who should report to you.
You must know your company policy regarding disasters, especially break-ins and other attacks. Some common policies include phoning the corporate attorney, the president or board members, and others in the company, and let them follow-up. A company may fear negative publicity more than the loss from a disaster, so the policy may be not to report the problem to anyone outside the company. Sometimes you report to the person in charge of publicity (marketing) and let them choose.
When planning your DRP you can contact your ISP and local law enforcement to see what procedures they recommend. Often government agencies such as the FBI (www.fbi.gov), police or other local law enforcement, FEMA (www.fema.gov), CERT (www.cert.org), US-CERT (www.us-cert.gov), and others should be notified in the event of an attack (although the FBI won't take action unless the loss is above $5000 or so, and won't give priority unless the loss is much greater).
If the loss affects the customers it may be required by various laws and regulations to report the disaster even if your company would prefer you not to. You should become familiar with the laws governing your organization's particular situation. Even if not required to report the problem in some cases the policy may be to report the problem to major customers.
In real life an attorney is consulted early to determine policies and procedures to follow that are required by law or by industry regulations or that are just a good idea to limit your company's legal liability. (For your class project it is OK to make this stuff up; that is you can pretend a lawyer said that you must have daily backups off-site, that you must notify the police, etc.)
As a professional system or network administrator you have responsibility to obey the laws and applicable regulations. If you feel your organization's policy is illegal or unethical you should work to resolve the issue early on. Otherwise you may be required to “whistle-blow” when a disaster strikes; this will not enhance your job prospects!
When a disaster is imminent it is a good time to perform backups, update system journals, contact backup sites (to let them know to get ready), send the documents and backups that need to be off-site, and other preparation steps. These are called proactive measures. This is also a good time to obtain a hardcopy of the current DRP and review it.
The specific steps to take after disaster has struck are usually known as reactive measures. Some specific measures that should be addressed in the case of a school such as HCC include:
Beside disaster recovery a company may (and should) have other policy and procedure documents. You may be asked to write such documents related to IT. You need to cover such topics as acceptable use of company equipment (e.g., the computers), data (e.g., customer lists), and services (e.g., email), strategic plans to replace desktop computers every so often, and so on. In addition you should inform employees about any privacy policies and related matters (e.g., password policies).
Policies and procedures that employees need to know about should be accessible, including items such as equipment use forms, account request forms, password reset procedure, etc. A good idea is to use a web server for all this and include an index and a search engine if you have a lot of documents.