Disaster Recovery

In our digital age a company’s data is often its most valuable asset.  It not only contains potentially proprietary or sensitive information, but is the result of countless hours of work.  The loss of a company’s data can be devastating and planning for disaster recovery is crucial.

The ability to restore IT systems and the data they contain provides protection against a myriad of issues, such as accidental deletion, crypto-locker attacks, catastrophic system failures, fires, natural disasters, theft, file corruption and malicious acts.

This blog we will discuss business continuity planning, which includes system backups, off-site disaster recovery (DR) sites, disaster recovery planning, hardware and software support contracts, documentation and recovery testing.

The starting point for business continuity planning is driven by an organization’s tolerance for downtime.  In a disaster, some businesses may be able to tolerate a week or more of downtime, while others cannot.  There is always a tradeoff between resiliency and cost.  Understanding an organization’s resiliency needs will drive the business continuity design.

Business continuity planning and fault tolerance can also include moving data to cloud services such as Microsoft SharePoint, Google Drive, and QuickBooks Online.  This can reduce or eliminate reliance on onsite systems. See our previous blog “SharePoint Migration to Enable Co-working”.

The first basic rule of system backups is the “3-2-1 rule,” which means have at least three system backups, on two different media, storing at least one of them offsite.  This method provides fault tolerance and ensures that you can restore your data.  Offsite backup is critical protection against crypto-locker attacks and disasters that impact the integrity of your IT systems such as fires, floods, earthquakes or theft.  This is a starting point for systems backup; additional layers of protection are recommended.

When designing any automated backup process, it’s important to keep in mind that “no news” doesn’t necessary imply “good news.”  Monitoring the success and failure of your system backups is critical.  Only receiving backup failure alerts can prevent you from learning about a complete failure of your backup process.

There are a wide range of backup media available, such as LTO tape drives, Synology NAS drives and backup/DR appliance vendors like Datto.  As well as software to manage backups, such as Symantec Backup Exec and Veeam.  Regardless of the specific technology, it’s important to understand the time required to restore your data.  If your backup solution is 100% cloud based you will be limited by your Internet download speed when recovering data.  In the case of small files this may be adequate.  But for large databases, restoration could take days.

If backing up to portable hard drives and LTO tapes, be sure to have a regular practice of rotating sets of media offsite.  For some clients this may be achieved by having a service, such as Iron Mountain, pickup and drop off sets of media on a weekly basis.  For systems that backup to NAS drives, media servers or backup/DR appliances, data is typically backed up locally and then replicated to an offsite cloud repository.

For some organizations where minimal downtime is critical to business operations, redundant systems and data can be hosted in a disaster recovery data center, typically in a different geographic area.  The “hot” sites can provide near instantaneous failover in the event of a system failure at the primary data center.  But maintaining these redundant systems can come at a steep cost.

The adage, an ounce of prevention is worth a pound of cure, is a notion that aptly applies to IT systems disaster recovery planning.  Preventing systems outages is far less expensive and disruptive than any recovery process.  For our clients this means actively monitoring the health of critical systems so that issues can be resolved before they create downtime.  Systems that have redundant components, such as hard drives, fans and power supplies, allow for replacement of those failed components before they cause a disruption.  Therefore, it is important to have systems in place to monitor these components so that corrective action can be taken.

For example, Hewlett Packard (HPE) servers can provide email and/or text notifications via their “Intelligent Lights Out” (iLO) platform or HP OpenView.  Meraki and Datto devices can provide notification if they fail to report into their respective cloud management consoles.  These notifications not only provide insight into issues with a particular device, but also into the loss of Internet services.  Imagine having your IT support provider resolve a loss of Internet service over the weekend, long before the first person shows up at the office on Monday morning.

Maintaining active support contracts for critical hardware and software systems is key to maintaining system up-time.  Support contracts can provide overnight replacement of failed hardware components or troubleshoot hardware or software issues.  The ability of IT support to restore system functionality can be severely hampered if replacement hardware isn’t readily available, or without access to a vendor’s technical support resources.

When disaster strikes, and loss of productivity and revenue increases the pressure on IT support, is the worst time to scramble for critical information needed to restore services.  As a preventative measure, maintain adequate documentation so that someone else can step into the role of providing IT support services.  They should not be hampered by a lack network diagrams, vendor account and phone numbers or administrative passwords.  Creating and maintaining this critical documentation can significantly minimize downtime.

Practice doesn’t make perfect; perfect practice makes perfect.  All of the best intentions cannot replace recovery testing – practice recovering critical IT systems.  Facing a systems failure is the wrong time to research how to perform a systems recovery.  Practicing your disaster recovery plan is the only way to uncover gaps in configuration, assumptions of recovery time tables, and a myriad of other issues.

The feedback from practicing will help you adjust current methods and improve resiliency.  Sometimes what’s needed can be as simple as a small wallet sized card with critical phone numbers.  Or updating contact information with vendors when there is staff turnover.  Often an organization’s internal IT resources have a limited amount of experience performing systems recovery.  Maintaining a relationship with an experienced IT support organization can be what’s needed to significantly minimize downtime.

The best time to plant a tree is yesterday.  The second best time is today.  With the insights from this blog, today is the perfect time to begin disaster recovery planning for your organization.   Please reach out to Sound Network Integration if we can help you with this work.