Hi,
This is a proposal for DaemonWatchdog improvements based on the bug:
http://tracker.ceph.com/issues/11314. Sending it to ceph-devel for
getting suggestions.
Current Functionality
---------------------
DaemonWatchdog watches the Ceph daemons for failures. If an extended
failure is detected (i.e. not intentional), then the watchdog unmount
file systems and send SIGTERM to all daemons. The duration of an
extended failure is configurable with watchdog_daemon_timeout. The
watchdog_daemon_timeout (default value: 300) is the number of seconds a
daemon is allowed to be failed before the watchdog barks (unmounting the
mounts and killing all the daemons).
DaemonWatchdog was originally written for watching the mds (and mon)
daemons for failures. It unmounts the mounted filesystems and kill the
mds (and mon) daemons.
Proposed Improvement
--------------------
As per John's suggestion here:
http://tracker.ceph.com/issues/11314#note-1, it would be better if we
extend this functionality to watch the other daemons too like osd, mon,
rgw and mgr and do the necessary action or logging (bark) when those
daemons crashes. We need to make those improvements in watch() and
bark() functions, so that if the daemon crashes unexpectedly, we detect
it immediately rather than waiting a long time for a timeout of some
kind. The bark() function should have different cases to handle
different daemons crashing. The procedure to be executed for ‘mds’ case
is present in the bark() function now. But we need to decide the
procedures for ‘osd’, ‘mon’, ‘rgw’ and ‘mgr’ cases. I think killing the
daemons and throwing/logging errors or maybe just throwing an error
would be sufficient.
* At present the class DaemonWatchdog is written in mds_thrash.py, as it
is specific for watching mds daemons. It would be better if we move it
out of mds_thrash.py to be generic, to a new file
qa/tasks/daemonwatchdog.py.
* The current code tries to watch the 'client'?
(https://github.com/ceph/ceph/blob/master/qa/tasks/mds_thrash.py#L87). I
have dropped this statement, as it is difficult to watch what the client
is doing in general.
* There is a suggestion to add the DaemonWatchdog to ceph.py and have it
always run whenever Ceph is "started".
Thanks,
Jos Collin