Full exploitation of a cluster hardware configuration requires some enhancements to a single-system operating system.

COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION

Operating System Design Issues
Full exploitation of a cluster hardware configuration requires some enhancements
to a single-system operating system.

FAILURE

MANAGEMENT

How failures are managed by a cluster depends on the
clustering method used (Table 17.2). In general, two approaches can be taken to
dealing with failures: highly available clusters and fault-tolerant clusters. A highly
available cluster offers a high probability that all resources will be in service. If a failure
occurs, such as a system goes down or a disk volume is lost, then the queries in progress
are lost. Any lost query, if retried, will be serviced by a different computer in the
cluster. However, the cluster operating system makes no guarantee about the state of
partially executed transactions. This would need to be handled at the application level.

A fault-tolerant cluster ensures that all resources are always available. This
is achieved by the use of redundant shared disks and mechanisms for backing out
uncommitted transactions and committing completed transactions.
The function of switching applications and data resources over from a failed
system to an alternative system in the cluster is referred to as failover. A related
function is the restoration of applications and data resources to the original system
once it has been fixed; this is referred to as failback. Failback can be automated, but
this is desirable only if the problem is truly fixed and unlikely to recur. If not, auto-
matic failback can cause subsequently failed resources to bounce back and forth
between computers, resulting in performance and recovery problems.

LOAD

BALANCING

A cluster requires an effective capability for balancing the
load among available computers. This includes the requirement that the cluster
be incrementally scalable. When a new computer is added to the cluster, the
load-balancing facility should automatically include this computer in scheduling
applications. Middleware mechanisms need to recognize that services can appear
on different members of the cluster and may migrate from one member to another.

PARALLELIZING

COMPUTATION

n some cases, effective use of a cluster requires
executing software from a single application in parallel. [KAPP00] lists three general
approaches to the problem:
• Parallelizing compiler: A parallelizing compiler determines, at compile time,
which parts of an application can be executed in parallel. These are then split
off to be assigned to different computers in the cluster. Performance depends
on the nature of the problem and how well the compiler is designed. In gen-
eral, such compilers are difficult to develop.
• Parallelized application: In this approach, the programmer writes the applica-
tion from the outset to run on a cluster, and uses message passing to move data,
as required, between cluster nodes. This places a high burden on the program-
mer but may be the best approach for exploiting clusters for some applications.

• Parametric computing: This approach can be used if the essence of the ap-
plication is an algorithm or program that must be executed a large number
of times, each time with a different set of starting conditions or parameters.
A good example is a simulation model, which will run a large number of dif-
ferent scenarios and then develop statistical summaries of the results. For this
approach to be effective, parametric processing tools are needed to organize,
run, and manage the jobs in an effective manner.