Introduction
This module focuses on the importance of Business Continuity, the factors that can affect Information Availability, and the consequences of information unavailability. This module also details on the BC planning process and BC technology solutions, specifically on eliminating single points of failure.
Introduction
This lesson covers the importance of business continuity to an organization, factors that can affect information availability and the consequences of information unavailability. Also this lesson focuses on information availability metrics namely mean time between failure (MTBF) and mean time to repair (MTTR).
Explanation
In today’s world, continuous access to information is a must for the smooth functioning of business operations. The cost of unavailability of information is greater than ever, and outages in key industries cost millions of dollars per hour. There are many threats to information availability, such as natural disasters, unplanned occurrences, and planned occurrences that could result in the inaccessibility of information. Therefore it is critical for businesses to define appropriate strategies that can help them to overcome these crises. Business continuity is an important process to define and implement these strategies.
It is a process that prepares for, responds to, and recovers from a system outage that can adversely affects business operations.
Business continuity (BC) is an integrated and enterprise-wide process that includes all activities (internal and external to IT) that a business must perform to mitigate the impact of planned and unplanned downtime. BC entails preparing for, responding to, and recovering from a system outage that adversely affects business operations. It involves proactive measures, such as business impact analysis, risk assessments, BC technology solutions deployment (backup and replication), and reactive measures, such as disaster recovery and restart, to be invoked in the event of a failure. The goal of a BC solution is to ensure the “information availability” required to conduct vital business operations.
In a virtualized environment, BC technology solutions need to protect both physical and virtualized resources. Virtualization considerably simplifies the implementation of BC strategy and solutions.
It is the ability of an IT infrastructure to function according to business expectations, during its specified time of operation.
IA ensures that people (employees, customers, suppliers, and partners) can access information whenever they need it. IA can be defined in terms of the accessibility, reliability, and timeliness of the information.
Various planned and unplanned incidents result in information unavailability. Planned outagesinclude installation/integration/maintenance of new hardware, software upgrades or patches, taking backups, application and data restores, facility operations (renovation and construction), and refresh/migration of the testing to the production environment. Unplanned outagesinclude failure caused by human errors, database corruption, and failure of physical and virtual components.
Another type of incident that may cause data unavailability is natural or man-made disasters, such as flood, fire, earthquake, and so on.
As illustrated in figure, the majority of outages are planned. Planned outages are expected and scheduled but still cause data to be unavailable. Statistically, the cause of information unavailability due to unforeseen disasters is less than 1 percent.
Information unavailability or downtime results in loss of productivity, loss of revenue, poor financial performance, and damages to reputation. Loss of productivity include reduced output per unit of labor, equipment, and capital. Loss of revenue includes direct loss, compensatory payments, future revenue loss, billing loss, and investment loss. Poor financial performance affects revenue recognition, cash flow, discounts, payment guarantees, credit rating, and stock price. Damages to reputations may result in a loss of confidence or credibility with customers, suppliers, financial markets, banks, and business partners. Other possible consequences of downtime include the cost of additional equipment rental, overtime, and extra shipping.
The business impact of downtime is the sum of all losses sustained as a result of a given disruption. An important metric, average cost of downtime per hour, provides a key estimate in determining the appropriate BC solutions. It is calculated as follows:
Average cost of downtime per hour = average productivity loss per hour + average revenue loss per hour
Where:
Productivity loss per hour = (total salaries and benefits of all employees per week) / (average number of working hours per week)
Average revenue loss per hour = (total revenue of an organization per week) / (average number of hours per week that an organization is open for business)
The average downtime cost per hour may also include estimates of projected revenue loss due to other consequences, such as damaged reputations, and the additional cost of repairing the system.
Information availability relies on the availability of both physical and virtual components of a data center. Failure of these components might disrupt information availability. A failure is the termination of a component’s ability to perform a required function. The component’s ability can be restored by performing an external corrective actions, such as a manual reboot, a repair, or replacement of the failed component(s). Proactive risk analysis, performed as part of the BC planning process, considers the component failure rate and average repair time, which are measured by MTBF and MTTR:
MTTR is calculated as: Total downtime/Number of failures
Information Availability (IA) can be expressed in terms of system uptime and downtime and measured as the amount or percentage of system uptime:
IA= system uptime / (system uptime + system downtime)
Where system uptimeis the period of time during which the system is in an accessible state; when it is not accessible, it is termed as system downtime.
In terms of MTBF and MTTR, IA could also be expressed as: IA = MTBF / (MTBF + MTTR)
Uptime per year is based on the exact timeliness requirements of the service. This calculation leads to the number of “9s” representation for availability metrics. Table on the slide lists the approximate amount of downtime allowed for a service to achieve certain levels of 9s availability.
For example, a service that is said to be “five 9s available” is available for 99.999 percent of the scheduled time in a year (24 ×365).
Introduction
This lesson covers various BC terminologies and BC planning. This lesson also focuses on eliminating single points of failure and multipathing software.
During this lesson the following topics are covered:
Explanation
Disaster recovery: This is the coordinated process of restoring systems, data, and the infrastructure required to support ongoing business operations after a disaster occurs. It is the process of restoring a previous copy of the data and applying logs or other necessary processes to that copy to bring it to a known point of consistency. After all recovery efforts are completed, the data is validated to ensure that it is correct.
Disaster restart: This is the process of restarting business operations with mirrored consistent copies of data and applications.
For example:
BC planning must follow a disciplined approach like any other planning process. Organizations today dedicate specialized resources to develop and maintain BC plans. From the conceptualization to the realization of the BC plan, a lifecycle of activities can be defined for the BC process. The BC planning lifecycle includes five stages:
A business impact analysis (BIA) identifies which business units, operations, and processes are essential to the survival of the business. It evaluates the financial, operational, and service impacts of a disruption to essential business processes. Selected functional areas are evaluated to determine resilience of the infrastructure to support information availability. The BIA process leads to a report detailing the incidents and their impact over business functions. The impact may be specified in terms of money or in terms of time. Based on the potential impacts associated with downtime, businesses can prioritize and implement countermeasures to mitigate the likelihood of such disruptions. These are detailed in the BC plan. A BIA includes the following set of tasks:
After analyzing the business impact of an outage, designing the appropriate solutions to recover from a failure is the next important activity. Following are the solutions and supporting technologies that enable business continuity and uninterrupted data availability:
Note: Backup and Replication will be discussed in forthcoming modules.
It refers to the failure of a component of a system that can terminate the availability of the entire system or IT service.
The figure depicts a system setup in which an application, running on a VM, provides an interface to the client and performs I/O operations. The client is connected to the server through an IP network, and the server is connected to the storage array through an FC connection. In this setup, each component must function as required to ensure data availability, the failure of a single physical or virtual component causes the unavailability of an application. This failure results in disruption of business operations. For example, failure of a hypervisor can affect all the running VMs and virtual network, which are hosted on it. In the figure on the slide, several single points of failure can be identified. A VM, a hypervisor, or an HBA/NIC on the server, the physical server itself, the IP network, the FC switch, the storage array port, or even the storage array could be a potential single point of failure.
To mitigate single points of failure, systems are designed with redundancy, such that the system fails only if all the components in the redundancy group fail. This ensures that the failure of a single component does not affect data availability. Data centers follow stringent guidelines to implement fault tolerance for uninterrupted information availability. Careful analysis is performed to eliminate every single point of failure.
The example in figure represents all enhancements in the infrastructure to mitigate single points of failure:
Configuration of multiple paths increases the data availability through path failover. If servers are configured with one I/O path to the data, there will be no access to the data if that path fails. Redundant paths to the data eliminate the possibility of the path becoming a single point of failure. Multiple paths to data also improve I/O performance through load balancing among the paths and maximize server, storage, and data path utilization.
In practice, merely configuring multiple paths does not serve the purpose. Even with multiple paths, if one path fails, I/O does not reroute unless the system recognizes that it has an alternative path. Multipathing software provides the functionality to recognize and utilize alternative I/O paths to data. Multipathing software also manages the load balancing by distributing I/Os to all available, active paths.
Multipathing software intelligently manages the paths to a device by sending I/O down the optimal path based on the load balancing and failover policy setting for the device. It also takes into account path usage and availability before deciding the path through which to send the I/O. If a path to the device fails, it automatically reroutes the I/O to an alternative path.
In a virtual environment, multipathing is enabled either by using the hypervisor’s built-in capability or by running a third-party software module, added to the hypervisor.
EMC PowerPath
EMC PowerPath is host-based multipathing software. Every I/O from the host to the array must pass through the PowerPath software, which allows PowerPath to provide intelligent I/O path management. PowerPath provides path failover and dynamic load balancing. PowerPath/VE software allows optimizing virtual environments with PowerPath multipathing features.
Summary
This module covered the importance of business continuity, impact of information unavailability, and information availability metrics. This module also focused on business continuity planning and business impact analysis. Further, this module detailed on single points of failure and multipathing software.
BC entails preparing for, responding to, and recovering from a system outage that adversely affects business operations. Information unavailability or downtime results in loss of productivity, loss of revenue, poor financial performance, and damages to reputation. Information availability metrics are MTBF and MTTR. MTBF defines average time available for a system or component to perform its normal operations between failures. MTTR defines the average time required to repair a failed component. A business impact analysis identifies which business units, operations, and processes are essential to the survival of the business. A single point of failure refers to the failure of a component that can terminate the availability of the entire system or IT service. Multipathing software provides the functionality to recognize and utilize alternate I/O paths to data. Multipathing software also manages the load balancing by distributing I/Os to all available, active paths.
Checkpoint
Bibliographic references