Proposal – DaemonWatchdog

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This is a proposal for DaemonWatchdog improvements based on the bug: http://tracker.ceph.com/issues/11314. Sending it to ceph-devel for getting suggestions.

Current Functionality
---------------------
DaemonWatchdog watches the Ceph daemons for failures. If an extended failure is detected (i.e. not intentional), then the watchdog unmount file systems and send SIGTERM to all daemons. The duration of an extended failure is configurable with watchdog_daemon_timeout. The watchdog_daemon_timeout (default value: 300) is the number of seconds a daemon is allowed to be failed before the watchdog barks (unmounting the mounts and killing all the daemons).

DaemonWatchdog was originally written for watching the mds (and mon) daemons for failures. It unmounts the mounted filesystems and kill the mds (and mon) daemons.

Proposed Improvement
--------------------
As per John's suggestion here: http://tracker.ceph.com/issues/11314#note-1, it would be better if we extend this functionality to watch the other daemons too like osd, mon, rgw and mgr and do the necessary action or logging (bark) when those daemons crashes. We need to make those improvements in watch() and bark() functions, so that if the daemon crashes unexpectedly, we detect it immediately rather than waiting a long time for a timeout of some kind. The bark() function should have different cases to handle different daemons crashing. The procedure to be executed for ‘mds’ case is present in the bark() function now. But we need to decide the procedures for ‘osd’, ‘mon’, ‘rgw’ and ‘mgr’ cases. I think killing the daemons and throwing/logging errors or maybe just throwing an error would be sufficient.

* At present the class DaemonWatchdog is written in mds_thrash.py, as it is specific for watching mds daemons. It would be better if we move it out of mds_thrash.py to be generic, to a new file qa/tasks/daemonwatchdog.py.

* The current code tries to watch the 'client'? (https://github.com/ceph/ceph/blob/master/qa/tasks/mds_thrash.py#L87). I have dropped this statement, as it is difficult to watch what the client is doing in general.

* There is a suggestion to add the DaemonWatchdog to ceph.py and have it always run whenever Ceph is "started".

Thanks,
Jos Collin



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux