WATCHDOG(8) WATCHDOG(8)
NAME
watchdog(5,8) - a software watchdog(5,8) daemon
SYNOPSIS
watchdog(5,8) [ -f | --force ] [ -c filename | --config-file filename ] [ -v
| --verbose ] [ -s | --sync ] [ -b | --softboot ] [ -q | --no-action ]
DESCRIPTION
Watchdog is a daemon that checks if(3,n) your system is still working. If
programs in(1,8) user space are not longer executed it will hard reset(1,7,1 tput) the
system.
The kernel provides /dev/watchdog(5,8), which when open(2,3,n) must be written to
within a minute or the machine will reboot. Each write(1,2) delays the
reboot time(1,2,n) another minute. After a minute the watchdog(5,8) hardware will
cause the reset. In the case of the software watchdog(5,8) the ability to
reboot will depend on the state of the machines and interrupts.
Watchdog can be stopped without causing a reboot if(3,n) the device
/dev/watchdog(5,8) is closed correctly, unless of course your kernel is com-
piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
TESTS
Watchdog itself does several additional tests to check the system sta-
tus:
Check whether the process table is full.
Check whether there is enough free memory available.
Check whether some given files are accessible.
Check whether some given files change in(1,8) a given interval.
Check whether the average work load(7,n) exceeds a predefined maximal value.
Check whether the a file(1,n) table overflow occurred.
Check whether a given process (specified by a pid file(1,n)) is still run-
ning.
Check whether some given IP addresses answer to a ping message.
Check whether some given network interfaces received some traffic.
Check the temperature (if(3,n) available).
Execute a user defined binary to do arbitrary tests.
If any of these checks fail watchdog(5,8) will cause a shutdown. Should any
of these tests except the user defined binary last longer than one
minute the machine will be rebooted, too.
OPTIONS
Available command line options are the following:
-v | --verbose
Set verbose mode. Only implemented if(3,n) compiled with SYSLOG fea-
ture. This mode will log each several infos in(1,8) LOG_DAEMON with
priority LOG_INFO. This is useful if(3,n) you want to see exactly
what happened until watchdog(5,8) rebooted the system. Currently it
logs the temperature (if(3,n) available), the load(7,n) average, the
change date of the files it checks and how often it went to
sleep.
-s | --sync
Try to sync(1,2,8) the filesystem every time(1,2,n) the process is awake. Be
aware that the system is rebooted if(3,n) for any reason syncing
lasts longer than a minute.
-b | --softboot
Soft-boot the system if(3,n) an error(8,n) occurs during the main loop,
e.g. if(3,n) the file(1,n) given with option -n is not accessible via the
stat(1,2) call. Note that this does not apply to the open(2,3,n) calls to
/dev/watchdog(5,8) and /proc(5,n)/loadavg which are opened before the main
loop starts.
-f | --force
Force the usage of the interval given or the maximal load(7,n) aver-
age given in(1,8) the config(1,5) file.
-c <config(1,5) file(1,n)> | --config-file <config(1,5) file(1,n)>
Use <config(1,5) file(1,n)> as config(1,5) file(1,n) instead of the default
/etc/watchdog.conf.
-q | --no-action
Do not reboot or halt the machine. This is for testing purposes.
All checks are executed and the results are logged as usual, but
no action is taken. Also your hardware card resp. the kernel
software watchdog(5,8) driver is not enabled. Note that temperature
checking is also disabled since this triggers the hardware
watchdog(5,8) on some cards.
FUNCTION
Watchdog starts, put itself into the background and then try all checks
specified in(1,8) its config(1,5) file(1,n) in(1,8) turn. Between each two tests it will
trigger the kernel device. After finishing all tests watchdog(5,8) goes to
sleep(1,3) for some time. The kernel drivers expects a write(1,2) to the watchdog(5,8)
device every minute. Otherwise the system will be rebooted. As a
default watchdog(5,8) will sleep(1,3) for only 10 seconds so it triggers the
device early enough.
Under high system load(7,n) watchdog(5,8) might be swapped out of memory and may
fail to make it back in(1,8) in(1,8) time. Under these circumstances the Linux
kernel will hard reset(1,7,1 tput) the machine. To make sure you won't get unnecas-
sary reboots make sure you have the variable 'realtime' set(7,n,1 builtins) to yes in(1,8)
the config(1,5) file(1,n) watchdog.conf. It adds real time(1,2,n) support to watchdog.
Thus it will lock itself into memeory and there should be no problem
even under the highest of loads.
Also you can specify a maximal allowed load(7,n) average. Once this load(7,n)
average is reached the system is rebooted. You may specify maximal load(7,n)
averages for 1 minute, 5 minutes or 15 minutes. The default values is
to disable this test. Be careful not to set(7,n,1 builtins) this parameter too low. To
set(7,n,1 builtins) a value less(1,3) then the predefined minimal value of 2, you have to
use the -f option.
You can also specify a minimal amount of virtual(5,8) memory you want to
have available as free. As soon as more virtual(5,8) memory is used action
is taken by watchdog. Note, however, that watchdog(5,8) does not distinguish
between different types of memory usage. It just checks for free vir-
tual(5,8) memory.
If you have a watchdog(5,8) card with temperature sensor you can specify the
maximal allowed temperature. Once this temperature is reached the sys-
tem is halted. Default value is 120. There is no unit conversion. So
make sure you use the same unit as your hardware. Watchdog will issue
warnings once the tempearture increases 90%, 95% and 98% of this tem-
perature.
When using file(1,n) mode watchdog(5,8) will try stat(1,2) the given files. Errors
returned by stat(1,2) will not cause a reboot. For a reboot the stat(1,2) call
has to last at least one minute. This may happen if(3,n) the file(1,n) is
located on an NFS mounted filesystem. If your system relies on an NFS
mounted filesystem you might try this option. However, in(1,8) such a case
the sync(1,2,8) option may not work if(3,n) the NFS server is not answering.
If you give watchdog(5,8) a pidfile it will read(2,n,1 builtins) the pid from this file(1,n) and
call kill(1,2,1 builtins)(pid,0) to see whether the process still exists. If not action
is taken by watchdog. So you can for instance restart the server from
your repair-binary.
Watchdog will try periodically to fork itself to see whether the
process table is full. This process will leave a zombie process until
watchdog(5,8) wakes up again and cathes it.
In ping mode watchdog(5,8) tries to ping the given addresses. These
addresses do not have to be a single machine. It is possible to ping to
a broadcast address instead to see if(3,n) at least one machine in(1,8) a subnet
is still living.
Do not use this broadcast ping unless your MIS person a) knows about it
and b) has given you explicit permission to use it!
Watchdog will send(2,n) out three ping packages and wait up to <interval>
seconds for the reply with <interval> being the time(1,2,n) it goes to sleep(1,3)
between two times triggering the watchdog(5,8) device. Thus a unreachable
network will not cause a hard reset(1,7,1 tput) but a soft reboot.
You can also test passively for an unreavhable network by just monitor-
ing a given interface for traffic. If no traffic arrives the network is
considered unreachable causing a soft reboot resp. action from the
repair binary.
With using an external check binary watchdog(5,8) can run user defined
tests. This may last longer than the time(1,2,n) slice defined for the kernel
device without a problem. However, note that in(1,8) this case error(8,n) mes-
sages are generated into the syslog(2,3,5,3 Sys::Syslog) facility. If you have enabled soft-
boot on error(8,n) the machine will be rebooted if(3,n) the binary doesn't exit(3,n,1 builtins)
in(1,8) half the time(1,2,n) watchdog(5,8) sleeps between two tries triggering the ker-
nel device.
If you specify a repair binary it will be started instead of shutting
down the system. If this binary is not able to fix the problem watchdog(5,8)
will still cause a reboot afterwards.
If eventually the machine is halted an email is send(2,n) to notify a human
that the machine is going down. Starting with version(1,3,5) 4.4 watchdog(5,8) will
also notify the human in(1,8) charge if(3,n) the machine is rebooted.
SOFT REBOOT
A soft reboot (i.e. controlled shutdown(2,8) and reboot) is initiated for
every error(8,n) that is found. Since there might be no more processes
available, watchdog(5,8) does it all by himself. That means:
1) Kill all processes with SIGTERM.
2) After a short pause kill(1,2,1 builtins) all remaining processes with SIGKILL.
3) Record a shutdown(2,8) entry in(1,8) wtmp.
4) Save the random(3,4,6) seed from /dev/urandom. If the device is non-exis-
tant or
the filename to save to is empty this step is skipped.
5) Turn off accounting.
6) Turn off quota(1,8) and swapp.
7) Unmount all partitions except the root partition.
8) Remount the root partition read-only.
9) Shut down all network interfaces.
10) Finally reboot.
CHECK BINARY
If the return code of the check binary is not zero watchdog(5,8) will assume
an error(8,n) and reboot the system. Be careful with this if(3,n) you are using
the real-time properties of watchdog(5,8) since watchdog(5,8) will wait for the
return of this binary before proceeding. An positive exit(3,n,1 builtins) code is
interpreted as an system error(8,n) code (see errno.h for details). Negative
values are special to watchdog:
-1 reboot the system. This is not exactly an error(8,n) message but a com-
mand to
watchdog. If the return code is -1 watchdog(5,8) will not try to run
a shutdown(2,8) script instead.
-2 reset(1,7,1 tput) the system. This is not exactly an error(8,n) message but a command
to
watchdog. If the return code is -2 watchdog(5,8) will simply refuse
to write(1,2) the kernel device again.
-3 max load(7,n) average exceeded.
-4 the temperature inside is too high.
-5 /proc(5,n)/loadavg contains no (or not enough) data.
-6 Given file(1,n) was not changed in(1,8) the given interval.
-7 /proc(5,n)/meminfo contains invalid data.
-8 free for personal use
REPAIR BINARY
The repair binary is started with one parameter: the error(8,n) num-
ber that caused watchdog(5,8) in(1,8) initiate the boot process. After
trying to repair the system the binary should exit(3,n,1 builtins) with 0 if(3,n) the
system was successfully repaired and thus there is no need to
boot anymore. A return value not equal 0 tells watchdog(5,8) to
reboot. The return code of the repair binary should be the error(8,n)
number of the error(8,n) causing watchdog(5,8) to reboot. Be careful with
this if(3,n) you are using the real-time properties of watchdog(5,8) since
watchdog(5,8) will wait for the return of this binary before proceed-
ing.
BUGS
None known so far.
AUTHORS
The original code is an example written by Alan Cox
<alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All addi-
tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
<johnie@netgod.net> had the idea of testing the load(7,n) average. He also
took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
brought up some hardware watchdog(5,8) issues and helped testing this stuff.
FILES
/dev/watchdog(5,8) The watchdog(5,8) device
/var/run/watchdog.pid The PID of the running watchdog(5,8)
SEE ALSO
watchdog.conf(5)
4th Berkeley Distribution February 1996 WATCHDOG(8)