System
Reliability: Procedures for Consistently High Performance
Reliability
relates to the establishment, implementation and management of procedures
which create a systems infrastructure that delivers acceptable and
consistent levels of performance. The functional components of reliability
management include the following disciplines:
Performance
and Tuning Workload
Balancing Security
Disaster
Recovery and Business Resumption
Performance
and Tuning
While
performance planning ensures that any systems performance service
level can be met, performance management and tuning ensures that
an active system is properly monitored and that developing problem
areas are identified and remedied prior to a service level degradation.
Three specific areas must be managed to ensure that service levels
are met:
- Resource
utilization
- Performance
monitoring
- System
optimization.
Resource
Utilization |
Resources
refer to the variety of components which make up a system
architecture. They include such items as: network, CPU, memory,
database, applications, disk, tape drives. Although a successful
capacity plan should identify the required resources, the
resources should also be consistently monitored for over and
under utilization.
|
Performance
Monitoring |
Although
resource utilization identifies if resources are available,
additional monitoring is required to ensure each resource
is performing as expected. Performance monitoring relates
to system performance as a whole. It includes:
- Network
monitoring
- Hardware
monitoring
- Software
monitoring
- Database
monitoring
|
System
Optimization |
To
ensure peak performance from all systems, proactive measures
must be taken to ensure systems are performing at peak efficiency.
System optimization takes into account overall system requirements
and creates architectures which will most efficiently support
the environment's needs. |
Workload
Balancing
Workload
Management is the process of monitoring the performance of distributed
processors and their peripheral devices and adjusting the load of
programs that are running on the systems, with the ultimate goal
of attaining peak performance. Achieving this goal is based upon
implementing each of the following:
- Scalable
applications
- Scalable
configurations
- Job
scheduling procedures
- Automation
tools.
Scalable
Applications |
Although
the majority of systems management disciplines reside within
Operations, Applications Development plays a vital role by
creating scalable applications. With annual growth rates of
20-30% on average, applications must be created to manage
increased workload without relying solely on increased hardware
capacity to support growth.
|
Scalable
Configurations |
An
effective capacity plan will indicate what expected workloads
will be on a system for years to come. Since it is ineffective
to create an architecture which will handle the ultimate workload,
a scalable configuration which allows for growth is required
when configuring networks, CPUs, memory, databases, applications,
disks and tape drives.
|
Job
Scheduling Procedures |
Job
scheduling is the process of executing batch programs within
predefined times and dates, as determined by the application
developers, business partners and technical support personnel.
This is essential to workload balancing since a number of
system maintenance routines, such as system backups, utilize
significant resources. Successful job scheduling will ensure
that conflicts between mission critical applications, system
maintenance routines and online users are kept to a minimum.
Job scheduling also supports workload balancing by allowing
for distribution of batch applications across systems within
the enterprise.
|
Automation
Tools |
Automation
tools refer to applications which support the effort of unassisted
systems management. These tools provide exception-based alerts
and automated error recovery. Automation tools take on a role
of great importance in today's n-tier distributed environment
where the number of systems requiring management is continually
increasing.
|
Security
Ensuring
the security of corporate data, including the physical security
of system hardware and networks as well as logical access to applications
or data is very important. There are seven definable areas which,
when implemented, provide the required security infrastructure.
These include:
-
System integrity - build systems that are inherently resistant
to viruses and other intrusion methods without the need for virus
protection or other software programmed to detect particular problems.
- Identification
and authentication - provide an authentication server so that
users can log on to a platform and have access credentials to
all necessary elements of the enterprise returned to them
- Authorization
- limit users' access to information and restrict their forms
of access on the system to only what is appropriate to them
- Data
integrity - treat data as a corporate asset and protect it through
file data integrity available in operating systems or through
supplemental security means
- Data
privacy - assure that information is accessible only as authorized
and that it cannot be acquired by unauthorized personnel or via
unauthorized means
- Ability
to audit - record and report events occurring in the system that
have a security significance, in order to support accountability
policies
- Administration
and organization - the systematic management and control of the
security mechanisms and services that are available on the various
network-attached platforms in a manner that helps ensure that
effective and efficient information security is in place. Mechanisms
are needed for information security reporting on applications.
Quick response exception reports for security managers need to
be available (no reporting should go to functional areas attempting
penetration).
As
with the other areas of system management, the ability to ensure
the above is made more difficult by the number of systems requiring
control. As a result, policies, methodologies & tools, and audit
procedures must be in place prior to implementing any system within
the enterprise. A definition of these required elements is described
below.
Policies
|
Policies
refer to the guidelines which define each of the seven areas
of the security infrastructure. The policies provide guidelines
for users and security administrators regarding the following:
- What
is and is not available within the system's infrastructure
- How
access to these systems is granted
- Who
has access to the systems
- When
access is available.
The
policies are enabled and enforced by appropriate methodologies,
tools and audit procedures.
|
Methodologies
and Tools |
Methodologies
and tools relate to the mechanisms used to implement and maintain
security. Methodologies refer to the support structure and
implementation of the actual security policies. The tools
are required to ensure effective and efficient enforcement
of the policies by providing for automated and centralized
security administration of the operations environment.
|
Audit
Procedures |
To
ensure policies are properly managed, audit procedures must
be implemented. These audit procedures ensure that each of
the seven areas which make up a security architecture are
kept in place and changed as required. Audits are supported
by the automated security tools implemented and enforced by
individuals and formal audit procedures.
|
Disaster
Recovery/ Business Resumption
Disaster
recovery is the capability of recovering mission critical workloads
in the event of a disastrous outage. Business resumption is the
capability of recovering an application in the event of an extended
unscheduled outage. Both disaster recovery and business resumption
rely on similar methodologies, and are therefore described together.
To
ensure successful disaster recovery and subsequent business resumption
the following must be in place:
- Disaster
Recovery/Business Resumption Plan
- Facilities
- Testing
procedures
- Compliance
enforcement.
Disaster
Recovery/ Business Resumption Plan |
The
first step to establishing a disaster recovery/business resumption
strategy is to document a detailed plan. The resulting plan
must incorporate all business needs, it must be maintained
and changed to support changing business requirements and
must be periodically tested to ensure its successful execution
in case it is needed.
|
Facilities |
Once
the plan is in place facilities must be secured which ensure
that critical systems can be restored as quickly as possible.
The facilities must include adequate preventative measures
as well as off site storage for data, back-up systems and
physical security mechanisms.
|
Testing
Procedures |
A
detailed test must be implemented to ensure all recovery and
resumption procedures are effective and remain up-to-date.
These testing procedures should simulate both local failures,
such as a downed CPU and more severe outages which may be
caused by a natural disaster.
|
Compliance
Enforcement |
Eliminating
or minimizing downtime which may result from a small-scale outage
or a more widespread disaster relies on strict compliance with
the disaster recovery/business resumption plan. Enforcement,
as with security, is the result of individual enforcement, automated
management tools and periodic audits.
|
Copyright ©
JJ Kuhl 2002
|