Hindsight is nearly always perfect. It is a condition where the fixed consequences of our actions become clear. In hindsight, variables to which we assign little significance morph into giants. In this rear view mirror we sometimes disparage ourselves for flawed decisions that make the present more challenging.
One doesn’t need to be blessed with foresight to know that the world around us is becoming reliant on IT and computing resources. Technology allows us to stay tethered to our businesses in real-time when we’re on the go, but the complexity that powers such things as up-to-the-minute inventory availability queries for 20 warehouses is staggering.
Functionality has also played an important role in our heightened dependence on information management systems. Today it’s all about the applications that companies use to drive their competitive advantage.
How these applications are made up has changed. Rather than being hosted on a single central server, applications are now made of a network of heterogeneous and interconnected systems. If any component fails, there’s a good chance that your customer won’t be able to complete their transaction, leading to lost orders, missing or delayed projects, and ultimately a lot of frustration for the user and the business.
In sum, technology has made us more efficient and responsive, and in doing so we have become dependent. Without an Internet connection or access to email or inventory, we’re helpless. You don’t have to be a bank or an airline to know how costly it can be to be offline. Today, small and medium sized businesses are dependent on information management technology. Not having a smart high availability and disaster recovery strategy in place is something you could definitely come to regret.
Inadequate downtime prevention results in missed opportunities, unclosed sales, shipments not delivered within a contractual period of time, incorrect or incomplete information for the company’s management to make critical strategic decisions, and ultimately poor customer satisfaction. While many of the circumstances that lead to downtime cannot be controlled, there are several ways to proactively reduce or eliminate system outages, but first, one needs to consider the the range of potential causes.
The roots of downtime
Servers can be taken offline by unplanned events, or by circumstances that require technicians to take them offline in the normal course of business. It is important to note that the latter causes the lions’ share of downtime.
Unplanned outages are usually caused by:
- Failure of system capabilities and capacities
- Failure of the CPU or other non-resilient system components
- Internal or external storage disk failure
- Software failure resulting in application downtime
- Human error causing disruption of the application or system
- Natural disasters such as fires, earthquakes, tornadoes, or power outages
Planned outages include:
- System backups in which all users are required to sign off from the system
- Operating system upgrades
- Hardware upgrades or migrations
- File reorganisation processes that purge old and deleted records
- Batch jobs that require dedicated systems resources
Tape and External Drives rule (sort of)
The overwhelming reliance on basic backup methods is somewhat perplexing since these offer very limited protection against downtime. There’s no question that tape and external drive backups can meet a requirement around data recovery, but recovery windows are often protracted.
These methods also don’t fully protect your business from data loss as they are based on copying data at specific points in time. However, the biggest problem is that recovery can be a dicey process where even a veteran technician can find it difficult to get a backup installed. All this adds up to missed opportunities and critical data loss.
Here’s a short list of common problems that can be associated with relying on external drives or tape.
- Relying on an external drive – Since external drives often reside in the same environment with production systems they are subject to the same hazards. A power surge or fire could destroy your production and backup data in a single stroke.
- Inadequate testing of backup media – Testing backups can be solely a manual endeavour. When the workload in the IT department swells, one task that often gets shelved is backup testing. Unfortunately, a backup volume has no value if it cannot be restored.
- Ignoring the fact that you have a single point of failure – External drive or tape backups are usually the only recovery path available after a production server failure. Since most IT shops rely exclusively on these backup methods and don’t adequately evaluate their vulnerabilities, they have nothing to fall back on when their backups fail to restore.
- Not performing full backups – Backups that consist of just application data complete more quickly and consume less storage than full system backups; hence, technicians most often focus on data. If a server fails and a full recovery needs to be performed quickly, you’ll need an up-to-date copy of the operating system and all of your production applications at current release levels. It’s far more difficult and time consuming to rebuild a system manually compared to taking full system backups ready to restore in the first place.
HA is a sound investment
One way to position yourself for a full recovery is though infrastructure redundancy and high availability (HA). HA involves protecting your critical application against problems in the first place. HA software essentially captures disk-writes at the host’s file system layer while all applications (including virtual machines writing to virtual disks) operate above the actual file system layer.
This allows the solution to protect data by replicating it to a target repository server running another copy of the HA solution. Advanced HA systems use replication and failover capabilities that continuously capture byte-level changes as they occur and replicate those changes to another server, either locally or over a WAN connection to a remote location.
- Lower the risk of significant costs to business such as lost revenue, lost productivity, legal penalties, and brand damage caused by unplanned downtime.
- Protect business relationships with customers, partners, and suppliers by ensuring that applications and data will be available to satisfy their needs and unique schedules.
- Ensure service levels will be met by maintaining predictable Recovery Time Objectives and Recovery Point Objectives in the event of an IT outage.
- Enhance return on investment on existing resources, by keeping them available to generate revenue and support business processes.
- Ensure compliance with any applicable government and trade regulations by securing email and record retention requirements and protecting the availability of business data and reporting processes.
By using real-time replication to a single set of repository servers, HA systems allow you to minimise the amount of hardware you need to dedicate to the backup process for your organisation. This limited number of repository servers can host both an up-to-date copy of all protected data along with the ability to recover from any point-in-time, all within the same set of physical or virtual machines. Tape backup equipment may still be used, but is now limited to backing up data from the repository servers as a long-term archive.
By implementing a solution that allows you to ensure availability through intervals of planned maintenance and unplanned disasters, you build a solid foundation for ongoing business activities, revenue streams and customer satisfaction. As an adjunct benefit, and as you reduce your annual ‘unavailability’ sum to near-zero, it’s unlikely that you’ll ever have to endure the humbling requirement for hindsight.