System
Availability: Ensuring Up-Time and Resolving Outages
Availability
consists of proactive methods of ensuring service up-time and resolving
system outages. This functionality requires monitoring and planning
for all components of the systems and creating a meaningful overview
of success for future process refinement. Components of availability
management include:
Data
Management
Data
Management supports the accessibility of data and protection of
that data as a key corporate resource. It includes the functions
of:
- Backup/
Restore/ Archiving
- Storage
management
- Database
management.
Accessibility
to the data is ensured through detection of fault conditions, avoidance
of space problems, review of file system structures, monitoring
file system usage and defining disk and tape resources by sizing
storage components.
Backup/
Restore/ Archiving
|
Backup/restore
and archiving procedures are central to the data management
process. Backup procedures require that critical data is securely
saved on a consistent basis. Once saved, data must be readily
available through defined restore procedures, allowing the
user to request retrieval of any critical data.
Archiving
may seem redundant with backup procedures, however, the primary
goal of archiving is to store critical data for the long term
in an efficient manner. Whereas backup data may be readily
available on disk, archived data may be saved on tape and
stored off-site.
|
Storage
Management
|
Availability
of data relies on effective storage management. Storage management
supports any form of storage media including tape, disk and
CD. The function of storage management is to ensure that available
storage media is available as needed and utilized to its fullest.
Continual review of stored data and resources is required
to promote required availability balanced with cost constraints.
|
Database
Management
|
Database
Management includes all aspects of managing the database environment
from initial product selections to daily operational monitoring.
Effective database management has become critical as more
data is being stored in a distributed fashion across the enterprise.
To
control this distributed environment, tools which support
centralized monitoring and sound data architectures must be
implemented. The tools will provide detection of data errors,
conflicts and potential resource issues. The data architecture
must balance the need of readily accessible data with a goal
of minimized redundancy.
|
Network
Management
Network
Management consists of monitoring network events within the environment
so problems can be detected and resolved before they have a major
impact to business processing. Successful management relies on the
deployment of agents throughout the enterprise to monitor the health
of various network elements, including:
- Breaching
of performance thresholds
- File
system utilization status
- Application
conditions
- Login
attempts
- Problems
detected with hardware or systems software.
Network
availability monitoring tools, diagnostic tools and other network
management devices should access a single operational repository.
Integrated tools to diagnose and self-administer repairs are needed
to reduce dependence on operator knowledge of specific products.
Network
Planning/ Architecture
|
Creating
a robust network environment relies on proper planning and
a well defined network architecture. Today's network must
support an increased number of systems and access points.
Additionally, increased availability has become critical in
today's distributed environment. Not only must the architecture
support increased user demands, systems management also increases
the strain by utilizing sophisticated monitoring tools, remote
backup/recovery, remote operations, and automated software
distribution.
|
Network
Operations
|
Operation
of the network in today's environment has become extremely
complex. To ensure successful management, monitoring tools
that provide centralized, exception-based alerts are needed,
as is automated error recovery to minimize user intervention
and overall downtime.
|
Network
Availability
|
As
availability of the network has become a critical issue in
today's environment, all points within the network must be
monitored to detect any outages. The architecture should support
overall availability by providing adequate bandwidth and needed
bypasses in case of network failure.
|
Application
Management
Applications
management within the environment has become more complicated as
sophisticated services come on-line on a diverse number of platforms.
Management of applications has also been influenced by the need
to provide centralized management of this environment. The tools
utilized for application management must support this new environment
by providing seamless integration between the platforms in the enterprise.
The diversity of systems requires centralized
management and continuous communications between the application
development and operations groups. Job flows must be well defined
and tested. Rerun/restart procedures must be implemented to ensure
proper action in the event of an ABEND or job failure.
Capacity
Planning
Capacity
Planning provides a mechanism for proactively determining the system
capacity required to successfully support an application given its
initial utilization and the projected growth in usage. Two factors
determine the capacity requirements of an existing application:
the number of users and the function provided. Projected changes
to either of these factors will require a review of the capacity
requirements for the hardware and networks upon which the applications
reside.
Capacity
Management is the set of processes by which currently installed
platforms are monitored for changes in capacity utilization, the
methodology for collecting and analyzing trending information is
applied, the recommendation of strategies to alleviate capacity
bottlenecks is presented and the resolution of pending capacity
issues is implemented.
Capacity has several parameters which must be considered. File system
size and structure, memory size and allocation, processor speed
and features, network topology and available bandwidths, and the
degree of contention with other applications for shared resources
must all be taken into consideration.
Modeling
|
Modeling
is the process of determining the capacity needs of a system
by running simulations or by performing simple calculations.
The key to modeling is to accurately create the perceived
production load and run it through a series of modeling algorithms.
The results are used to determine if proposed configurations
will meet demand.
Unlike
forecasting, modeling is primarily used to determine the needs
of a new application defining its impact on existing systems
and any required hardware.
|
Forecasting
|
Forecasting
is the proactive process of projecting expected capacity requirements.
In many companies, forecasts are often used to feed the annual
capacity plan. The resultant capacity plan defines the mainframe,
midrange and network requirements associated with the forecasted
total. Projections are also made several years in advance
to ensure that the technology in place will keep pace with
requirements or if newer technology is required.
Both
forecasting and modeling are made more difficult with the
continued implementation of new and more numerous platforms.
Additionally, proper sizing of the network will become more
critical as applications and overall system management strategies
become reliant on available bandwidth.
|
Acquisitions
|
Although
acquisitions may seem out of place within capacity planning,
the need to acquire hardware and software in a timely manner
to support overall capacity needs is critical. The overall
acquisitions process must define procedures from start to
finish providing required checkpoints to ensure required hardware
and software is on-site to support the existing and future
capacity needs of the enterprise.
|
Management
Reporting
An
effective systems management methodology provides management with
the information necessary to make informed decisions. Key information
is extracted from the systems environment and presented in a manner
which is both value added and exception- based.
There
are four key areas where reporting is essential:
- Metrics
on overall performance and service levels
- Security
Violations
- Problems
- Changes
Metrics
|
Overall
performance metrics are essential to successful management
reporting. Metrics should be available for every systems management
discipline, providing value added information regarding trends,
volumes, service level numbers etc.
|
Security
Violations
|
The
security manager as well as managers responsible for secured
data or restricted access areas require information regarding
possible security violations. Although the overall security
infrastructure should minimize such instances it is vital
to provide management exposure in instances of security violations.
|
Problems
|
Effective
problem management requires reporting of problems to each
responsible area. Problem reporting exposes management to
any issues which may effect their area, or in the instances
where a particular area may be responsible for correcting
the situation.
|
Changes
|
As
with problems, management must be made aware of changes occurring
in areas of interest or where their department is responsible
for the change.
|
Copyright ©
JJ Kuhl 2002
|