System Availability: Ensuring Up-Time and Resolving Outages


Availability consists of proactive methods of ensuring service up-time and resolving system outages. This functionality requires monitoring and planning for all components of the systems and creating a meaningful overview of success for future process refinement. Components of availability management include:

Data Management

Data Management supports the accessibility of data and protection of that data as a key corporate resource. It includes the functions of:

  1. Backup/ Restore/ Archiving
  2. Storage management
  3. Database management.

Accessibility to the data is ensured through detection of fault conditions, avoidance of space problems, review of file system structures, monitoring file system usage and defining disk and tape resources by sizing storage components.

Backup/ Restore/ Archiving

Backup/restore and archiving procedures are central to the data management process. Backup procedures require that critical data is securely saved on a consistent basis. Once saved, data must be readily available through defined restore procedures, allowing the user to request retrieval of any critical data.

Archiving may seem redundant with backup procedures, however, the primary goal of archiving is to store critical data for the long term in an efficient manner. Whereas backup data may be readily available on disk, archived data may be saved on tape and stored off-site.

 

Storage Management

Availability of data relies on effective storage management. Storage management supports any form of storage media including tape, disk and CD. The function of storage management is to ensure that available storage media is available as needed and utilized to its fullest. Continual review of stored data and resources is required to promote required availability balanced with cost constraints.

 

Database Management

Database Management includes all aspects of managing the database environment from initial product selections to daily operational monitoring. Effective database management has become critical as more data is being stored in a distributed fashion across the enterprise.

To control this distributed environment, tools which support centralized monitoring and sound data architectures must be implemented. The tools will provide detection of data errors, conflicts and potential resource issues. The data architecture must balance the need of readily accessible data with a goal of minimized redundancy.

Return to top

Network Management

Network Management consists of monitoring network events within the environment so problems can be detected and resolved before they have a major impact to business processing. Successful management relies on the deployment of agents throughout the enterprise to monitor the health of various network elements, including:

  • Breaching of performance thresholds
  • File system utilization status
  • Application conditions
  • Login attempts
  • Problems detected with hardware or systems software.

Network availability monitoring tools, diagnostic tools and other network management devices should access a single operational repository. Integrated tools to diagnose and self-administer repairs are needed to reduce dependence on operator knowledge of specific products.

Network Planning/ Architecture

Creating a robust network environment relies on proper planning and a well defined network architecture. Today's network must support an increased number of systems and access points. Additionally, increased availability has become critical in today's distributed environment. Not only must the architecture support increased user demands, systems management also increases the strain by utilizing sophisticated monitoring tools, remote backup/recovery, remote operations, and automated software distribution.

 

Network Operations

Operation of the network in today's environment has become extremely complex. To ensure successful management, monitoring tools that provide centralized, exception-based alerts are needed, as is automated error recovery to minimize user intervention and overall downtime.

 

Network Availability

As availability of the network has become a critical issue in today's environment, all points within the network must be monitored to detect any outages. The architecture should support overall availability by providing adequate bandwidth and needed bypasses in case of network failure.

 

Return to top

Application Management

Applications management within the environment has become more complicated as sophisticated services come on-line on a diverse number of platforms. Management of applications has also been influenced by the need to provide centralized management of this environment. The tools utilized for application management must support this new environment by providing seamless integration between the platforms in the enterprise.


The diversity of systems requires centralized management and continuous communications between the application development and operations groups. Job flows must be well defined and tested. Rerun/restart procedures must be implemented to ensure proper action in the event of an ABEND or job failure.

Return to top

Capacity Planning

Capacity Planning provides a mechanism for proactively determining the system capacity required to successfully support an application given its initial utilization and the projected growth in usage. Two factors determine the capacity requirements of an existing application: the number of users and the function provided. Projected changes to either of these factors will require a review of the capacity requirements for the hardware and networks upon which the applications reside.

Capacity Management is the set of processes by which currently installed platforms are monitored for changes in capacity utilization, the methodology for collecting and analyzing trending information is applied, the recommendation of strategies to alleviate capacity bottlenecks is presented and the resolution of pending capacity issues is implemented.

Capacity has several parameters which must be considered. File system size and structure, memory size and allocation, processor speed and features, network topology and available bandwidths, and the degree of contention with other applications for shared resources must all be taken into consideration.

Modeling

Modeling is the process of determining the capacity needs of a system by running simulations or by performing simple calculations. The key to modeling is to accurately create the perceived production load and run it through a series of modeling algorithms. The results are used to determine if proposed configurations will meet demand.

Unlike forecasting, modeling is primarily used to determine the needs of a new application defining its impact on existing systems and any required hardware.

 

Forecasting

Forecasting is the proactive process of projecting expected capacity requirements. In many companies, forecasts are often used to feed the annual capacity plan. The resultant capacity plan defines the mainframe, midrange and network requirements associated with the forecasted total. Projections are also made several years in advance to ensure that the technology in place will keep pace with requirements or if newer technology is required.

Both forecasting and modeling are made more difficult with the continued implementation of new and more numerous platforms. Additionally, proper sizing of the network will become more critical as applications and overall system management strategies become reliant on available bandwidth.

 

Acquisitions

Although acquisitions may seem out of place within capacity planning, the need to acquire hardware and software in a timely manner to support overall capacity needs is critical. The overall acquisitions process must define procedures from start to finish providing required checkpoints to ensure required hardware and software is on-site to support the existing and future capacity needs of the enterprise.

 

Return to top

Management Reporting


An effective systems management methodology provides management with the information necessary to make informed decisions. Key information is extracted from the systems environment and presented in a manner which is both value added and exception- based.

There are four key areas where reporting is essential:

  • Metrics on overall performance and service levels
  • Security Violations
  • Problems
  • Changes
Metrics

Overall performance metrics are essential to successful management reporting. Metrics should be available for every systems management discipline, providing value added information regarding trends, volumes, service level numbers etc.

 

Security Violations

The security manager as well as managers responsible for secured data or restricted access areas require information regarding possible security violations. Although the overall security infrastructure should minimize such instances it is vital to provide management exposure in instances of security violations.

 

Problems

Effective problem management requires reporting of problems to each responsible area. Problem reporting exposes management to any issues which may effect their area, or in the instances where a particular area may be responsible for correcting the situation.

 

Changes

As with problems, management must be made aware of changes occurring in areas of interest or where their department is responsible for the change.

 

Return to top

Copyright © JJ Kuhl 2002