Characteristics of an Operations Interface
Characteristics of an Operations Interface
Copryright (C) 1996-1997
JXH Consulting
The original email message about this told a long story of a complicated
process of debugging a software subsystem running in production, ultimately
to arrive at the conclusion
that the subsystem was operating perfectly, but the operations staff (UNIX
sysadmins) were unable easily to establish this. It entailed looking
in log files and hand-decoding highly non-obvious information based
on an external document, including comments in source code. From an
operational standpoint, it was unacceptable that operations staff be
required to learn that much about such systems to be able to support them.
Out of that we evolved the concept of an operations interface document,
having the following general characteristics. Some of them sound very
complicated, but simple systems can have very simple interfaces. There's a
sample document for Sendmail.
- 1. Uniform Terminology
-
Systems should be designed and documented, as
much as possible, to conform to a uniform set of terms that describe
features of the operations interface. Synonyms should be avoided.
For instance, "fault" specifically describes a condition of the
system that is not as it should be; it should not be called "error" or
"exception" or another synonym, as these may have other defined meanings.
Above all, all systems in the same operational administration should
use the same term for a given thing.
- 2. No Source
-
Does not require operations staff have access to, or
read or understand, the source files for the product itself, including
shell scripts or other interpreted code. It should be a black box to
the extent possible.
- 3. No Mods
-
Ops staff should not be required to make any permanent change
to any system. If it used to work, and it breaks, they'll put it back.
Anything beyond that is engineering or a software release, and must be
done through the proper release procedures.
- 4. Start/Stop
-
Ops must be able to stop, start, and restart the product
exclusively by executing scripts in /etc/init.d with only the "stop",
"start", and "restart" arguments, per System V standards. Every piece of
the product which can foreseeably need to be restarted separately should
have a separate script. For processes that fork permanent children,
and try (and sometimes fail) to reap them when the parent is stopped,
the stop script must find and kill them. It is also necessary, but not
sufficient, to document the identifying names of all revelant children,
in case they must be identified and killed manually.
- 5. Fault Status
-
Ops staff must be able, by a written procedure, to
determine whether the product has detected any kind of internal fault.
Steps to clear it, or to escalate based on the type of fault, should
be documented.
- 6. Self-Test
-
Ops staff must be able, by a written procedure, to trigger
a self test of the product, and see the result. Interpretation of the
output should be as obvious as possible, as far as go/no-go; further
interpretation can be directed by written procedures. This should be a
"local", internal unit test, depending as little as possible on outside
influences.
- 7. Verify/Request
-
Ops staff must be able, by a written procedure, to
trigger an end-to-end system test, by making a real request or otherwise
proving that the entire system correctly performs its primary function.
This will typically happen at the outermost interface of the system
or subsystem.
- 8. Error Messages
-
Errors and faults must produce meaningful and helpful
messages, written to a standardized log (e.g. syslog). They must indicate
(1) the time, in GMT if possible; (2) the identity of the process or
subsystem making the report, with PID and argv[0] command name; (3)
the general nature of the operation that was happening when the error
occurred, e.g. "opening connection to subsystem Z"; (4) the system
error code as returned by UNIX, if any, via perror(); (5) the specific
arguments to a system call or other interface, e.g. full pathnames of
files; and most important, (6) a recommendation or clue about what to
_do_ about the condition. Messages can occupy more than one line or
log file entry, but should be concise yet complete. They may refer
to external documentation for more detailed recommendations about what
action to take, but must capture all necessary details. In short, error
messages are supposed to hand you _solutions_, not problems.
- 9. Trace/Debug
-
Ops staff must be able to enable trace output or debug
output for a given operation, or for all system operations, even if they
are not completely trained in how to interpret it. Ops staff should
be trained to capture the relevant output, and take the _first steps_
to interpret it, before calling the engineers. The principle is that
ops staff have brains, and can use them beyond the edge of the written
procedures, given _some_ information.
- 10. Crash Dump
-
Ops staff must be able to cause the system or
subsystem to write a crash dump for later analysis by the engineers.
Written procedures must say how to cause it, where the dump is written,
how to preserve it for analysis, and how to take the _first steps_ to
analyze it (see Trace/Debug).
- 11. Command Line Interface
-
The system must have a command-line
interface, if more elaborate than the Bourne shell, and this must
be documented. It must be usable from a single shell; no graphics.
If a system-specific interface exists, it should be designed to resemble
closely something with which the ops staff are already familiar, for
instance the shell or a Cisco router. It need not duplicate something
else; it obviously can't. But it shouldn't be utterly unlike anything
that's ever been seen before.
- 12. Safety
-
Any system-specific interface, if used for engineering
debugging as well as operations, must have a "safety catch" of some
kind, so that ops staff do not unintentionally enter commands that only
engineers should. For instance, Cisco routers have "enable". It may
or may not be protected by a password or other authentication, provided
the system as a whole is adequately protected from unauthorized use.
- 13. Documentation/Training
-
Any system put into production must be
accompanied by written operations procedures, showing how all the above
requirements are met; and by an overall system document or diagram
showing how the parts interrelate _from a UNIX perspective_. That is,
what processes are created and why; what files and network connections
they open; and the general flow of information. Product groups must
prepare training materials, and do the initial training of ops staff, for
any new product. Product groups must maintain the training materials to
be correct and complete, with each product change released to production.