SteelEye LifeKeeper

Leading High Availability and Disaster Recovery Solution

Disaster Recovery Solutions

The disaster recovery solution enables the use of remote standby servers to take over an application in case of server or application failure. Data is replicated in real-time to the remote site by using data replication software or storage devices.

LifeKeeper sits on all servers in the cluster monitoring the health of the systems and active applications. When a failure is detected, recovery procedures are started automatically. There are currently two options available for disaster recovery:

  • 2 node: where there is one server on the local site and one server at the recovery site.
  • 3 node: where there are two servers at the local site and one at the recovery site.

A basic 2-node solution can provide offsite disaster recovery and real time backups, the more sophisticated 3-node solution adds high availability and local failover for those not-quite-disasterous occasions.

The 3 node is preferable where budget permits as it allows easier recovery for non-disaster related failures, and high availability against WAN failure and is therefore a cleaner overall solution.

2-node disaster recovery

Data replication is used to maintain separate identical local copies of the application data on the two servers. With the application active on the primary server, all updates to the application data are automatically replicated to the standby server.

When a failure occurs, the application is automatically started on the standby server, it continues its operations using a mirrored copy of the data. If the primary server is returned to service, the direction of the data replication can be reversed, and after an initial resynchronization process to bring the primary server up-to-date with any data changes which may have occurred while it was unavailable, returned to front line service.

There is no need to copy the entire disk when a server is returned to service, only changed data is replicated and there is no need for white space replication with LifeKeeper Data Replication. This allows for multi-gigabyte stores to operate over relatively low speed connections.

Failover does not affect clients

When the application migrates to the standby server, LifeKeeper also migrates IP addresses and hostnames ensuring that clients normally do not even notice the failover, and at worst are simply required to reconnect to the server to restablish their session.

Advantages of the 2-node approach

  • Minimises hardware and licensing costs
  • Ensures off-site, real-time backups of all critical data.

3-node Disaster Recovery Solution - combining local recovery and disaster recovery

Using the ability to replicate to more than one server at a time, data can be replicated to both a local server and a remote server. Therefore, when there is a failure in the active application or server it causes the local standby server to run the application with minimum disruption to the users and clients. The remote standby is available for use when a site-wide disaster occurs. However, a failure at the site will cause the server at the remote site to run the application.

The application is active on the primary server, and is also configured as a local standby server. Application data is also being replicated to the remote system. The result is a 3-node cluster, consisting of two local systems, and a third remote system receiving data updates via data replication over a wide area network (WAN).

 

 

When the local application or server fails, the local standbt server takes over the application.

When the application is unable to run on both local servers, it will run on the offsite server.

When a site failure occurs

The application data is already on the remote server so the application is able to migrate to the remote site with little or no disruption to the users. This migration can be automatic or manual.

Return to service

When the local server is returned to service the direction of the data replication is reversed automatically from the server that is currently active (either the local or remote). The main active server is switched back to being active as soon as the data replication is complete.

Advantages of this approach

  • Little or no disruption is experienced by users when a server or application failure occurs.
  • Local failover can be used for administration purposes such as upgrades and maintenance
  • Local failover is in place for ‘everyday’ disruptions such as network glitches .
  • The remote server is only used for site disasters.
  • Wan replication can be stopped during peak traffic and restarted as required.
  • If the WAN link is unavailable, the local copy of the data is still available and protected.


Original Source: www.openminds.co.uk