
CON5 is a successor to CON4, with improvements, and CON4, in turn, is
based on CON3.  In order to understand CON5, the motivation for writing
it, and what problems it solves, you must first know something about
CON3.


CON3
====

CON3 is a test program designed to stress disk controllers and their
device drivers, and test that the system functions correctly under
heavy load.  CON3 accomplishes this by running many concurrent
instances of a basic test which copies and compares directories, and
reports trouble if any of the copies fails or does not compare as
expected.

From the bottom up, the hierarchy of tests goes as follows:

The most basic building block of CON3 is CON3single.sh.  CON3single.sh
uses find | cpio -p to copy an entire directory tree from a given
source directory to a given target directory.  Then it runs diff -r to
compare the source and target.  Any discrepancy is an error and causes
a failure to be reported.

CON3double.sh drives CON3single.sh.  It runs CON3single.sh once, at
first, to get another copy to participate in comparisons.  Then it runs
CON3single.sh 4 times, concurrently, with different combinations of
source and target directories.

CON3concur.sh runs CON3double.sh for a given number of concurrent
instances.

Finally, CON3repeat is the overall controlling program.  It repeats
CON3double.sh for a given number of times, sequentially.


CON4
====

See CON4 README file.


Problems remaining with CON4
============================

 1) CON3 and CON4 do not autoconfigure number of instances of CON4double.

 2) There are two important classes of failures, for the purposes
    of deciding how to proceed:

    a) There are failures due to running out of resources,
       such as disk space, virtual memory or allocation of child processes.
       These failures do not indicate anything wrong with an HBA
       or its device driver or, for that matter, anything else about
       the system.  It just means you you ran out of a limited resource.
       You need to run fewer instances.
    b) All other failures are likely to indicate a genuine problem
       somewhere in the the hardware or in the implementation of Unix,
       either in a device driver, filesystem code, or something unrelated
       that just happens to clobber a kernel buffer.

   CON3 and CON4 are unable to distinguish between these types of problems.


CON5
====

The main reason for writing CON5 is to solve problem #2.

CON5 can be given a number of concurrent instances of CON5double, just
as it is done in CON3 and CON4.  However, this number is optional, and
when it is given, it is used as an upper bound.

CON5 calculates a budget for the maximum number of instances that can
be supported by the available disk space on the filesystem of the
target directory.  It also calculates a budget for the number of
instances that can be supported by the size of the system's process
table, and how many are process slots are still available when CON5
starts.  To start, CON5 uses the smallest of the calculated budgets and
the given number of instances.

Once CON5 gets under way, it is capable of adjusting the number of
instances downward, in case there are problems of resource exhaustion.
CON5 does all of its own low-level management of fork() and exec()
system calls.

fork
----
If a fork() system call fails and the value of errno is EAGAIN, that
indicates that there are too many processes running at the moment, and
that you might succeed if you try again later.  CON5 uses a common
procedure to do a fork() system call and test for this condition, and
if it encounters this error, it backs off, sleeps and retries, until a
limit is reached, then gives up.  If any CON5 process gives up on a
fork() system call, it exits with a special out-of-band status.  That
is, it is not any exit status that are ever encountered when running
the unix commands that CON5 runs.


exec
----
If an exec() system call fails and the value of errno is EAGAIN, that
indicates that there is not sufficient virtual memory.  If any CON5
process get this error, it exits with an out-of-band status.

CON5 propagates the out-of-band exit status values all the way up the
chain of parent processes, until CON5double receives it.
At that time CON5double does 3 things:
  1) destroys all files left behind by that instance
     and pretends that it never happened;
  2) permanently reduces the number of instances;
  3) records in the log the fact that the number of instances has changed.


CON5 IMPLEMENTATION
===================

It is not feasible to implement CON5 in any of the commonly available
shell programming languages.  Of the readily available languages, Perl
and C could do what is required to implement CON5.  C was chosen,
because not that many people in the QA department are familiar or
comfortable with Perl.

-- Guy Shaw

