System Reliability: Procedures for Consistently High Performance

Reliability relates to the establishment, implementation and management of procedures which create a systems infrastructure that delivers acceptable and consistent levels of performance. The functional components of reliability management include the following disciplines:

Performance and Tuning • Workload Balancing • Security • Disaster Recovery and Business Resumption

Performance and Tuning

While performance planning ensures that any systems performance service level can be met, performance management and tuning ensures that an active system is properly monitored and that developing problem areas are identified and remedied prior to a service level degradation. Three specific areas must be managed to ensure that service levels are met:

Resource utilization
Performance monitoring
System optimization.

Resource Utilization

Resources refer to the variety of components which make up a system architecture. They include such items as: network, CPU, memory, database, applications, disk, tape drives. Although a successful capacity plan should identify the required resources, the resources should also be consistently monitored for over and under utilization.

Performance Monitoring

Although resource utilization identifies if resources are available, additional monitoring is required to ensure each resource is performing as expected. Performance monitoring relates to system performance as a whole. It includes:

Network monitoring
Hardware monitoring
Software monitoring
Database monitoring

System Optimization

To ensure peak performance from all systems, proactive measures must be taken to ensure systems are performing at peak efficiency. System optimization takes into account overall system requirements and creates architectures which will most efficiently support the environment's needs.

Workload Balancing

Workload Management is the process of monitoring the performance of distributed processors and their peripheral devices and adjusting the load of programs that are running on the systems, with the ultimate goal of attaining peak performance. Achieving this goal is based upon implementing each of the following:

Scalable applications
Scalable configurations
Job scheduling procedures
Automation tools.

Scalable Applications	Although the majority of systems management disciplines reside within Operations, Applications Development plays a vital role by creating scalable applications. With annual growth rates of 20-30% on average, applications must be created to manage increased workload without relying solely on increased hardware capacity to support growth.
Scalable Configurations	An effective capacity plan will indicate what expected workloads will be on a system for years to come. Since it is ineffective to create an architecture which will handle the ultimate workload, a scalable configuration which allows for growth is required when configuring networks, CPUs, memory, databases, applications, disks and tape drives.
Job Scheduling Procedures	Job scheduling is the process of executing batch programs within predefined times and dates, as determined by the application developers, business partners and technical support personnel. This is essential to workload balancing since a number of system maintenance routines, such as system backups, utilize significant resources. Successful job scheduling will ensure that conflicts between mission critical applications, system maintenance routines and online users are kept to a minimum. Job scheduling also supports workload balancing by allowing for distribution of batch applications across systems within the enterprise.
Automation Tools	Automation tools refer to applications which support the effort of unassisted systems management. These tools provide exception-based alerts and automated error recovery. Automation tools take on a role of great importance in today's n-tier distributed environment where the number of systems requiring management is continually increasing.

Security

Ensuring the security of corporate data, including the physical security of system hardware and networks as well as logical access to applications or data is very important. There are seven definable areas which, when implemented, provide the required security infrastructure. These include:

System integrity - build systems that are inherently resistant to viruses and other intrusion methods without the need for virus protection or other software programmed to detect particular problems.
Identification and authentication - provide an authentication server so that users can log on to a platform and have access credentials to all necessary elements of the enterprise returned to them
Authorization - limit users' access to information and restrict their forms of access on the system to only what is appropriate to them
Data integrity - treat data as a corporate asset and protect it through file data integrity available in operating systems or through supplemental security means
Data privacy - assure that information is accessible only as authorized and that it cannot be acquired by unauthorized personnel or via unauthorized means
Ability to audit - record and report events occurring in the system that have a security significance, in order to support accountability policies
Administration and organization - the systematic management and control of the security mechanisms and services that are available on the various network-attached platforms in a manner that helps ensure that effective and efficient information security is in place. Mechanisms are needed for information security reporting on applications. Quick response exception reports for security managers need to be available (no reporting should go to functional areas attempting penetration).

As with the other areas of system management, the ability to ensure the above is made more difficult by the number of systems requiring control. As a result, policies, methodologies & tools, and audit procedures must be in place prior to implementing any system within the enterprise. A definition of these required elements is described below.

Policies

Policies refer to the guidelines which define each of the seven areas of the security infrastructure. The policies provide guidelines for users and security administrators regarding the following:

What is and is not available within the system's infrastructure
How access to these systems is granted
Who has access to the systems
When access is available.

The policies are enabled and enforced by appropriate methodologies, tools and audit procedures.

Methodologies and Tools

Methodologies and tools relate to the mechanisms used to implement and maintain security. Methodologies refer to the support structure and implementation of the actual security policies. The tools are required to ensure effective and efficient enforcement of the policies by providing for automated and centralized security administration of the operations environment.

Audit Procedures

To ensure policies are properly managed, audit procedures must be implemented. These audit procedures ensure that each of the seven areas which make up a security architecture are kept in place and changed as required. Audits are supported by the automated security tools implemented and enforced by individuals and formal audit procedures.

Disaster Recovery/ Business Resumption

Disaster recovery is the capability of recovering mission critical workloads in the event of a disastrous outage. Business resumption is the capability of recovering an application in the event of an extended unscheduled outage. Both disaster recovery and business resumption rely on similar methodologies, and are therefore described together.

To ensure successful disaster recovery and subsequent business resumption the following must be in place:

Disaster Recovery/Business Resumption Plan
Facilities
Testing procedures
Compliance enforcement.

Disaster Recovery/ Business Resumption Plan	The first step to establishing a disaster recovery/business resumption strategy is to document a detailed plan. The resulting plan must incorporate all business needs, it must be maintained and changed to support changing business requirements and must be periodically tested to ensure its successful execution in case it is needed.
Facilities	Once the plan is in place facilities must be secured which ensure that critical systems can be restored as quickly as possible. The facilities must include adequate preventative measures as well as off site storage for data, back-up systems and physical security mechanisms.
Testing Procedures	A detailed test must be implemented to ensure all recovery and resumption procedures are effective and remain up-to-date. These testing procedures should simulate both local failures, such as a downed CPU and more severe outages which may be caused by a natural disaster.
Compliance Enforcement	Eliminating or minimizing downtime which may result from a small-scale outage or a more widespread disaster relies on strict compliance with the disaster recovery/business resumption plan. Enforcement, as with security, is the result of individual enforcement, automated management tools and periodic audits.

System Reliability: Procedures for Consistently High Performance

Performance and Tuning

Workload Balancing

Security

Disaster Recovery/ Business Resumption

Copyright © JJ Kuhl 2002