Seth Woolley's Man Viewer

Manual for watchdog - man watchdog

([section] manual, -k keyword, -K [section] search, -f whatis)
man plain no title

WATCHDOG(8)                                                        WATCHDOG(8)



NAME
       watchdog(5,8) - a software watchdog(5,8) daemon

SYNOPSIS
       watchdog(5,8) [ -f | --force ] [ -c filename | --config-file filename ] [ -v
       | --verbose ] [ -s | --sync ] [ -b | --softboot ] [ -q | --no-action ]

DESCRIPTION
       Watchdog is a daemon that checks if(3,n) your system is  still  working.  If
       programs  in(1,8)  user space are not longer executed it will hard reset(1,7,1 tput) the
       system.

       The kernel provides /dev/watchdog(5,8), which when open(2,3,n) must be  written  to
       within  a  minute  or  the  machine  will reboot. Each write(1,2) delays the
       reboot time(1,2,n) another minute. After a minute the watchdog(5,8)  hardware  will
       cause  the  reset.  In the case of the software watchdog(5,8) the ability to
       reboot will depend on the state of the machines and interrupts.

       Watchdog can  be  stopped  without  causing  a  reboot  if(3,n)  the  device
       /dev/watchdog(5,8) is closed correctly, unless of course your kernel is com-
       piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.


TESTS
       Watchdog itself does several additional tests to check the system  sta-
       tus:

       Check whether the process table is full.

       Check whether there is enough free memory available.

       Check whether some given files are accessible.

       Check whether some given files change in(1,8) a given interval.

       Check whether the average work load(7,n) exceeds a predefined maximal value.

       Check whether the a file(1,n) table overflow occurred.

       Check  whether  a given process (specified by a pid file(1,n)) is still run-
       ning.

       Check whether some given IP addresses answer to a ping message.

       Check whether some given network interfaces received some traffic.

       Check the temperature (if(3,n) available).

       Execute a user defined binary to do arbitrary tests.

       If  any of these checks fail watchdog(5,8) will cause a shutdown. Should any
       of these tests except the user defined  binary  last  longer  than  one
       minute the machine will be rebooted, too.


OPTIONS
       Available command line options are the following:

       -v | --verbose
              Set  verbose mode. Only implemented if(3,n) compiled with SYSLOG fea-
              ture. This mode will log each several infos in(1,8)  LOG_DAEMON  with
              priority  LOG_INFO.   This  is useful if(3,n) you want to see exactly
              what happened until watchdog(5,8) rebooted the system.  Currently  it
              logs  the  temperature  (if(3,n)  available),  the  load(7,n) average, the
              change date of the files it checks and  how  often  it  went  to
              sleep.

       -s | --sync
              Try  to  sync(1,2,8) the filesystem every time(1,2,n) the process is awake. Be
              aware that the system is rebooted  if(3,n)  for  any  reason  syncing
              lasts longer than a minute.

       -b | --softboot
              Soft-boot  the  system  if(3,n) an error(8,n) occurs during the main loop,
              e.g. if(3,n) the file(1,n) given with option -n is not accessible via  the
              stat(1,2)  call.  Note  that this does not apply to the open(2,3,n) calls to
              /dev/watchdog(5,8) and /proc(5,n)/loadavg which are opened before the main
              loop starts.

       -f | --force
              Force  the usage of the interval given or the maximal load(7,n) aver-
              age given in(1,8) the config(1,5) file.

       -c <config(1,5) file(1,n)> | --config-file <config(1,5) file(1,n)>
              Use  <config(1,5)  file(1,n)>  as  config(1,5)  file(1,n)  instead  of  the  default
              /etc/watchdog.conf.

       -q | --no-action
              Do not reboot or halt the machine. This is for testing purposes.
              All checks are executed and the results are logged as usual, but
              no  action  is  taken.  Also your hardware card resp. the kernel
              software watchdog(5,8) driver is not enabled. Note  that  temperature
              checking  is  also  disabled  since  this  triggers the hardware
              watchdog(5,8) on some cards.


FUNCTION
       Watchdog starts, put itself into the background and then try all checks
       specified  in(1,8)  its  config(1,5) file(1,n) in(1,8) turn. Between each two tests it will
       trigger the kernel device. After finishing all tests watchdog(5,8)  goes  to
       sleep(1,3) for some time. The kernel drivers expects a write(1,2) to the watchdog(5,8)
       device every minute.  Otherwise the  system  will  be  rebooted.  As  a
       default  watchdog(5,8)  will  sleep(1,3)  for  only 10 seconds so it triggers the
       device early enough.

       Under high system load(7,n) watchdog(5,8) might be swapped out of memory and  may
       fail  to  make  it back in(1,8) in(1,8) time. Under these circumstances the Linux
       kernel will hard reset(1,7,1 tput) the machine. To make sure you won't get unnecas-
       sary  reboots  make sure you have the variable 'realtime' set(7,n,1 builtins) to yes in(1,8)
       the config(1,5) file(1,n) watchdog.conf. It adds real time(1,2,n) support  to  watchdog.
       Thus  it  will  lock itself into memeory and there should be no problem
       even under the highest of loads.

       Also you can specify a maximal allowed load(7,n)  average.  Once  this  load(7,n)
       average is reached the system is rebooted. You may specify maximal load(7,n)
       averages for 1 minute, 5 minutes or 15 minutes. The default  values  is
       to  disable this test. Be careful not to set(7,n,1 builtins) this parameter too low. To
       set(7,n,1 builtins) a value less(1,3) then the predefined minimal value of 2,  you  have  to
       use the -f option.

       You  can  also  specify  a minimal amount of virtual(5,8) memory you want to
       have available as free. As soon as more virtual(5,8) memory is  used  action
       is taken by watchdog. Note, however, that watchdog(5,8) does not distinguish
       between different types of memory usage. It just checks for  free  vir-
       tual(5,8) memory.

       If you have a watchdog(5,8) card with temperature sensor you can specify the
       maximal allowed temperature. Once this temperature is reached the  sys-
       tem  is  halted.  Default value is 120. There is no unit conversion. So
       make sure you use the same unit as your hardware. Watchdog  will  issue
       warnings  once  the tempearture increases 90%, 95% and 98% of this tem-
       perature.

       When using file(1,n) mode watchdog(5,8) will try stat(1,2)  the  given  files.  Errors
       returned  by  stat(1,2)  will not cause a reboot. For a reboot the stat(1,2) call
       has to last at least one minute.   This  may  happen  if(3,n)  the  file(1,n)  is
       located  on  an NFS mounted filesystem. If your system relies on an NFS
       mounted filesystem you might try this option.  However, in(1,8) such a  case
       the sync(1,2,8) option may not work if(3,n) the NFS server is not answering.

       If  you give watchdog(5,8) a pidfile it will read(2,n,1 builtins) the pid from this file(1,n) and
       call kill(1,2,1 builtins)(pid,0) to see whether the process still exists. If not action
       is  taken  by watchdog. So you can for instance restart the server from
       your repair-binary.

       Watchdog will try periodically  to  fork  itself  to  see  whether  the
       process  table  is full. This process will leave a zombie process until
       watchdog(5,8) wakes up again and cathes it.

       In ping  mode  watchdog(5,8)  tries  to  ping  the  given  addresses.  These
       addresses do not have to be a single machine. It is possible to ping to
       a broadcast address instead to see if(3,n) at least one machine in(1,8) a  subnet
       is still living.

       Do not use this broadcast ping unless your MIS person a) knows about it
       and b) has given you explicit permission to use it!

       Watchdog will send(2,n) out three ping packages and wait  up  to  <interval>
       seconds  for  the reply with <interval> being the time(1,2,n) it goes to sleep(1,3)
       between two times triggering the watchdog(5,8) device.  Thus  a  unreachable
       network will not cause a hard reset(1,7,1 tput) but a soft reboot.

       You can also test passively for an unreavhable network by just monitor-
       ing a given interface for traffic. If no traffic arrives the network is
       considered  unreachable  causing  a  soft  reboot resp. action from the
       repair binary.

       With using an external check  binary  watchdog(5,8)  can  run  user  defined
       tests.  This may last longer than the time(1,2,n) slice defined for the kernel
       device without a problem. However, note that in(1,8) this  case  error(8,n)  mes-
       sages are generated into the syslog(2,3,5,3 Sys::Syslog) facility. If you have enabled soft-
       boot on error(8,n) the machine will be rebooted if(3,n) the binary  doesn't  exit(3,n,1 builtins)
       in(1,8)  half the time(1,2,n) watchdog(5,8) sleeps between two tries triggering the ker-
       nel device.

       If you specify a repair binary it will be started instead  of  shutting
       down the system. If this binary is not able to fix the problem watchdog(5,8)
       will still cause a reboot afterwards.

       If eventually the machine is halted an email is send(2,n) to notify a  human
       that the machine is going down. Starting with version(1,3,5) 4.4 watchdog(5,8) will
       also notify the human in(1,8) charge if(3,n) the machine is rebooted.


SOFT REBOOT
       A soft reboot (i.e. controlled shutdown(2,8) and reboot)  is  initiated  for
       every  error(8,n)  that  is  found.  Since  there might be no more processes
       available, watchdog(5,8) does it all by himself. That means:

       1) Kill all processes with SIGTERM.

       2) After a short pause kill(1,2,1 builtins) all remaining processes with SIGKILL.

       3) Record a shutdown(2,8) entry in(1,8) wtmp.

       4) Save the random(3,4,6) seed from /dev/urandom. If the device  is  non-exis-
       tant or
              the filename to save to is empty this step is skipped.

       5) Turn off accounting.

       6) Turn off quota(1,8) and swapp.

       7) Unmount all partitions except the root partition.

       8) Remount the root partition read-only.

       9) Shut down all network interfaces.

       10) Finally reboot.


CHECK BINARY
       If the return code of the check binary is not zero watchdog(5,8) will assume
       an  error(8,n)  and reboot the system. Be careful with this if(3,n) you are using
       the real-time properties of watchdog(5,8) since watchdog(5,8) will wait  for  the
       return  of  this  binary  before  proceeding.  An positive exit(3,n,1 builtins) code is
       interpreted as an system error(8,n) code (see errno.h for details). Negative
       values are special to watchdog:

       -1  reboot  the system. This is not exactly an error(8,n) message but a com-
       mand to
              watchdog.  If the return code is -1 watchdog(5,8) will not try to run
              a shutdown(2,8) script instead.

       -2 reset(1,7,1 tput) the system. This is not exactly an error(8,n) message but a command
       to
              watchdog. If the return code is -2 watchdog(5,8) will  simply  refuse
              to write(1,2) the kernel device again.

       -3 max load(7,n) average exceeded.

       -4 the temperature inside is too high.

       -5 /proc(5,n)/loadavg contains no (or not enough) data.

       -6 Given file(1,n) was not changed in(1,8) the given interval.

       -7 /proc(5,n)/meminfo contains invalid data.

       -8 free for personal use



       REPAIR BINARY
              The  repair binary is started with one parameter: the error(8,n) num-
              ber that caused watchdog(5,8) in(1,8) initiate  the  boot  process.  After
              trying to repair the system the binary should exit(3,n,1 builtins) with 0 if(3,n) the
              system was successfully repaired and thus there is  no  need  to
              boot  anymore.  A  return  value  not  equal 0 tells watchdog(5,8) to
              reboot. The return code of the repair binary should be the error(8,n)
              number  of the error(8,n) causing watchdog(5,8) to reboot. Be careful with
              this if(3,n) you are using the real-time properties of watchdog(5,8) since
              watchdog(5,8) will wait for the return of this binary before proceed-
              ing.

BUGS
       None known so far.


AUTHORS
       The   original   code   is   an   example   written   by    Alan    Cox
       <alan@lxorguk.ukuu.org.uk>,  the author of the kernel driver. All addi-
       tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
       <johnie@netgod.net>  had  the idea of testing the load(7,n) average. He also
       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
       brought up some hardware watchdog(5,8) issues and helped testing this stuff.


FILES
       /dev/watchdog(5,8)  The watchdog(5,8) device
       /var/run/watchdog.pid The PID of the running watchdog(5,8)

SEE ALSO
       watchdog.conf(5)



4th Berkeley Distribution        February 1996                     WATCHDOG(8)

References for this manual (incoming links)