|
|
How Should the IT Department Prepare
for a Disaster Recovery?
Preparing for a disaster must start
with a comprehensive Disaster Recovery plan that is
agreed to by all stakeholders. The plan must start with
the business goals, specifically the Business Continuity
Plan if one exists.
A challenge I have often come up
against is that the business sometimes (often in smaller
companies) has neither plan in place. If that is your
situation, I would recommend that the IT organization
take the lead on defining the business requirements,
which means discussions with the various business
leaders to understand their requirements and using those
to drive the plan.
Here are the usual steps to defining
a Disaster Recovery Plan:
-
Define the type of disaster(s) you are planning to
survive. Hardware based disaster, local building
disaster, city wide disaster, regional disaster,
etc.
Some may result in a
simple local systems fail-over or repair, others
will require a more complex recovery scenario.
-
Plan for the location of your fail-over site. It is
critical that this site is beyond the boundaries of
the worst case disaster you are planning for.
-
Identify the systems that are critical to continuity
of your business.
This
requires a significant interaction with the business
management.
During this phase you
will define the Recovery Time Objective (RTO) – how
long from when the systems fail will it take to
bring the recovery system online and Recovery Point
Objective (RPO) – how old is the most recent piece
of data that has been replicated to the backup site.
Both of these numbers
will have wide ranges, depending on their
criticality.
As an example, an
accounting system may have a 24 hour recovery time
and a 24 hour recovery point – essentially bring it
back up with last night’s backup data.
A trading system may
have a recovery time of minutes with a zero Recovery
Point, meaning that transactions in flight need to
be replicated (a transaction is not committed until
the data is written to both the primary and
fail-over databases.
-
Determine the resources required to run the business
critical systems identified above.
This
means people and technology.
-
Determine how replication between the sites will be
done. This often varies depending on the application
-
Establish private connectivity between the sites.
-
Determine how changes to network connectivity will
be accomplished when a disaster happens and a
fail-over event is declared.
-
Define who has the authority and responsibility to
declare a disaster and invoke the Business Recovery
Plan and Disaster Recovery Plan.
This impacts the entire
business, so this is sometimes a group of senior
managers.
The communications and
decision process must be defined within the plans.
-
Define where your employees will go when your
primary work site is unavailable.
Roles and
responsibilities during a disaster event MUST be
defined and articulated in the plan.
-
Establish policies of how employees will connect to
the data or network.
-
Establish a notification system to advise everyone
who needs to know when the Business Continuity Plan
and associated Disaster Recovery Plan is activated.
-
Test and drill to identify what did not work well
and constantly refine the process.
Defining the plan is an iterative
process. The first initial expectation will often be the
instantaneous failover of an application and no data
loss. Once the business understands the cost of such an
implementation, more serious discussions can take place.
Some observations from past efforts I have been involved
in:
-
The plan
must clearly define the criteria for declaring a
disaster as well as the governance aspects, who can
declare, who gets notified, etc. This plan must be
communicated to all stakeholders
-
Many
requirements are driven by regulatory requirements,
especially in industries such as Financial Services
and Healthcare
-
The DR
implementation typically does not support the entire
staff, but the more critical subset of staff that is
needed to keep the business running until the
disaster is addressed. This may be a few people in
each department
-
The DR
technical implementation is often smaller than the
primary environment and may be a virtualized version
set up (loaded) at the point of the disaster. I have
often seen a Development or QA environment
“repurposed” during a disaster to become the
Disaster Recovery platform.
-
Connectivity
becomes a major component of a successful plan.
There will likely be a disaster site which houses
some number of the most critical staff (such as
traders in a Financial Services firm) and then
support for staff to connect remotely from home or
other location
-
Data
replication becomes one of the most important and
often challenging parts of the plan. The Recovery
Point for a DR fail-over essentially deals with how
much data can be lost when the systems swing to the
DR site. This can be “no-data-loss” to substantial
and will be different for each application.
-
Location of
the DR site is of major importance – far enough from
the primary site to ensure it is not impacted by the
event that causes the disaster and close enough to
the company’s IT staff to allow it to be managed.
Use of colo facilities can help manage this
particular issue
If the company has multiple
locations, paired data centers can be considered. In
other words multiple primary data centers that are used
to back up each other.
Don’t forget to consider the
fail-back portion of the plan. This is non-trivial and
includes both the criteria for fail-back as well as the
methodology for replicating new data back from the DR
site to the primary site.
Also, don’t forget the office
communications technologies, namely phones and fax
machines. Some number of these will need to be
redirected to alternate numbers.
Finally – test, test, test! Once the
entire environment is set up there should be an actual
DR test at least twice each year. This semi-annual test
should include the business folks. When changes are made
to the primary environment, those changes should be
reflected in the DR implementation and tested (separate
test from the semi-annual test)
There are some good sample Disaster
Recovery plans available on the Internet. These can be
used as a framework from which you can build your plan.
I've skimmed at 50,000 feet here but there are plenty of
resources available to help you drill into as much
detail as you require.
|