Proposal – DaemonWatchdog

Jos Collin <jcollin@xxxxxxxxxx> · Wed, 15 May 2019 11:09:57 +0530

Hi,

This is a proposal for DaemonWatchdog improvements based on the bug: 
http://tracker.ceph.com/issues/11314. Sending it to ceph-devel for 
getting suggestions.

Current Functionality
---------------------
DaemonWatchdog watches the Ceph daemons for failures. If an extended 
failure is detected (i.e. not intentional), then the watchdog unmount 
file systems and send SIGTERM to all daemons. The duration of an 
extended failure is configurable with  watchdog_daemon_timeout. The 
watchdog_daemon_timeout (default value: 300) is the number of seconds a 
daemon is allowed to be failed before the watchdog barks (unmounting the 
mounts and killing all the daemons).

DaemonWatchdog was originally written for watching the mds (and mon) 
daemons for failures. It unmounts the mounted filesystems and kill the 
mds (and mon) daemons.

Proposed Improvement
--------------------
As per John's suggestion here: 
http://tracker.ceph.com/issues/11314#note-1, it would be better if we 
extend this functionality to watch the other daemons too like osd, mon, 
rgw and mgr and do the necessary action or logging (bark) when those 
daemons crashes. We need to make those improvements in watch() and 
bark() functions, so that if the daemon crashes unexpectedly, we detect 
it immediately rather than waiting a long time for a timeout of some 
kind. The bark() function should have different cases to handle 
different daemons crashing. The procedure to be executed for ‘mds’ case 
is present in the bark() function now. But we need to decide the 
procedures for ‘osd’, ‘mon’, ‘rgw’ and ‘mgr’ cases. I think killing the 
daemons and throwing/logging errors or maybe just throwing an error 
would be sufficient.

* At present the class DaemonWatchdog is written in mds_thrash.py, as it 
is specific for watching mds daemons. It would be better if we move it 
out of mds_thrash.py to be generic, to a new file 
qa/tasks/daemonwatchdog.py.

* The current code tries to watch the 'client'? 
(https://github.com/ceph/ceph/blob/master/qa/tasks/mds_thrash.py#L87). I 
have dropped this statement, as it is difficult to watch what the client 
is doing in general.

* There is a suggestion to add the DaemonWatchdog to ceph.py and have it 
always run whenever Ceph is "started".

Thanks,
Jos Collin