:: Home   :: Contact us   :: Feedback   :: Site Map  
| About us | Services | Reference Work | Resources | Careers
Portfolio of Services 

Technology & Facility Relocation

Data Center, Systems and Network Infrastructure

Infrastructure

Assessments

CIO-On-Call and Interim Technology Management

Technology Strategy

How Should the IT Department Prepare for a Disaster Recovery?

Preparing for a disaster must start with a comprehensive Disaster Recovery plan that is agreed to by all stakeholders. The plan must start with the business goals, specifically the Business Continuity Plan if one exists.

A challenge I have often come up against is that the business sometimes (often in smaller companies) has neither plan in place. If that is your situation, I would recommend that the IT organization take the lead on defining the business requirements, which means discussions with the various business leaders to understand their requirements and using those to drive the plan.

Here are the usual steps to defining a Disaster Recovery Plan:

  1. Define the type of disaster(s) you are planning to survive. Hardware based disaster, local building disaster, city wide disaster, regional disaster, etc.  Some may result in a simple local systems fail-over or repair, others will require a more complex recovery scenario.

  2. Plan for the location of your fail-over site. It is critical that this site is beyond the boundaries of the worst case disaster you are planning for.

  3. Identify the systems that are critical to continuity of your business.   This requires a significant interaction with the business management.  During this phase you will define the Recovery Time Objective (RTO) – how long from when the systems fail will it take to bring the recovery system online and Recovery Point Objective (RPO) – how old is the most recent piece of data that has been replicated to the backup site.  Both of these numbers will have wide ranges, depending on their criticality.  As an example, an accounting system may have a 24 hour recovery time and a 24 hour recovery point – essentially bring it back up with last night’s backup data.  A trading system may have a recovery time of minutes with a zero Recovery Point, meaning that transactions in flight need to be replicated (a transaction is not committed until the data is written to both the primary and fail-over databases.

  4. Determine the resources required to run the business critical systems identified above.   This means people and technology.

  5. Determine how replication between the sites will be done. This often varies depending on the application

  6. Establish private connectivity between the sites.

  7. Determine how changes to network connectivity will be accomplished when a disaster happens and a fail-over event is declared.

  8. Define who has the authority and responsibility to declare a disaster and invoke the Business Recovery Plan and Disaster Recovery Plan.  This impacts the entire business, so this is sometimes a group of senior managers.  The communications and decision process must be defined within the plans.

  9. Define where your employees will go when your primary work site is unavailable.  Roles and responsibilities during a disaster event MUST be defined and articulated in the plan.

  10. Establish policies of how employees will connect to the data or network.

  11. Establish a notification system to advise everyone who needs to know when the Business Continuity Plan and associated Disaster Recovery Plan is activated.

  12. Test and drill to identify what did not work well and constantly refine the process.

Defining the plan is an iterative process. The first initial expectation will often be the instantaneous failover of an application and no data loss. Once the business understands the cost of such an implementation, more serious discussions can take place. Some observations from past efforts I have been involved in:

  • The plan must clearly define the criteria for declaring a disaster as well as the governance aspects, who can declare, who gets notified, etc. This plan must be communicated to all stakeholders

  • Many requirements are driven by regulatory requirements, especially in industries such as Financial Services and Healthcare

  • The DR implementation typically does not support the entire staff, but the more critical subset of staff that is needed to keep the business running until the disaster is addressed. This may be a few people in each department

  • The DR technical implementation is often smaller than the primary environment and may be a virtualized version set up (loaded) at the point of the disaster. I have often seen a Development or QA environment “repurposed” during a disaster to become the Disaster Recovery platform.

  • Connectivity becomes a major component of a successful plan. There will likely be a disaster site which houses some number of the most critical staff (such as traders in a Financial Services firm) and then support for staff to connect remotely from home or other location

  • Data replication becomes one of the most important and often challenging parts of the plan. The Recovery Point for a DR fail-over essentially deals with how much data can be lost when the systems swing to the DR site. This can be “no-data-loss” to substantial and will be different for each application.

  • Location of the DR site is of major importance – far enough from the primary site to ensure it is not impacted by the event that causes the disaster and close enough to the company’s IT staff to allow it to be managed. Use of colo facilities can help manage this particular issue

If the company has multiple locations, paired data centers can be considered. In other words multiple primary data centers that are used to back up each other.

Don’t forget to consider the fail-back portion of the plan. This is non-trivial and includes both the criteria for fail-back as well as the methodology for replicating new data back from the DR site to the primary site.

Also, don’t forget the office communications technologies, namely phones and fax machines. Some number of these will need to be redirected to alternate numbers.

Finally – test, test, test! Once the entire environment is set up there should be an actual DR test at least twice each year. This semi-annual test should include the business folks. When changes are made to the primary environment, those changes should be reflected in the DR implementation and tested (separate test from the semi-annual test)

There are some good sample Disaster Recovery plans available on the Internet. These can be used as a framework from which you can build your plan.

I've skimmed at 50,000 feet here but there are plenty of resources available to help you drill into as much detail as you require.

 

Contact NewVista Advisors at:      Sales@NVAdvisors.com

©2006 to 2011 NewVista Advisors, llc - All Rights Reserved - 22 Indian Wells Road; Brewster, New York 10509 845-278-0617

Site Map        Web Site Terms and Conditions