OVERVIEW With the cost of application down time spiraling out of control, IT managers and staff are looking to find ways to provide higher application availability without drastically increasing hardware or more importantly, staffing costs. This document is intended to provide information on using VERITAS Cluster Server to increase application availability in a cost effective way. It will provide an overview of the history of clustering solutions in the open systems market and detail how VERITAS clustering differs from other solutions in the control of larger, more cost effective clusters.
A BRIEF HISTORY OF HIGH AVAILABILITY IN THE IT ENVIRONMENT EARLY METHODS USED TO INCREASE AVAILABILITY. As open systems proliferated, IT managers became concerned with the impact of system outages on various business units. These concerns lead to the development of tools to increase overall system availability. One of the easiest points to address is those components with the highest failure rates. Hardware vendors began providing systems with built in redundant components such as power supplies and fans. These high failure items were now protected by in-system spare components that would assume duty on the failure of another. Disk drives were also another constant fail item. Hardware and software vendors responded with disk mirroring and RAID devices. As individual components became more reliable, managers looked to decrease exposure to losing an entire system. Just like having a spare power supply or disk drive, IT managers wanted the capability to have a spare system to take over on a system failure. Early configurations were just that, a spare system. The goal was to reduce Mean Time to Repair through better serviceability. Switching to the ready spare system required less time than repairing the original system. On a system failure, external storage containing application data would be disconnected from the failed system, connected to the spare system, then the spare brought into service. This action can be called “failover”. In a properly designed scenario, the client systems would require no change to recognize the spare system. This is accomplished by having the now promoted spare system takeover the network identity of its original peer. As storage systems evolved, the ability to connect more than one host to a storage array was developed. By “dual-hosting” a given storage array, the spare system could be brought online quicker in the event of failure. Dual hosting storage meant that the spare system would no longer have to be manually cabled on a failure. Having a system ready to utilize application data lead to development of scripts to assist the spare server in functioning as a “takeover” server. In the event of a failure, the proper scripts could be run to effectively change the personality of the spare to mirror the original failed server. As the technology matured, network technology advanced as well, and began providing a capability to create a “virtual address” Rather than change the network identity of the takeover server to match the original server, clients are configured to communicate with a specific IP/hostname that is brought up by the server currently running the application in question. On a failure, this virtual address is brought up on the takeover server.