Re: Proposal – DaemonWatchdog

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 15 May 2019 15:47:48 -0700



I never really worked out why the FS tests were susceptible to these
issues but the core RADOS thrashing tasks weren't. If you're planning
to expand the DaemonWatchdog you probably want to work that out. Maybe
just because the ceph_manager thrashing does a lot of restarting
proactively? Or do the MDS thrash tasks just not pay attention to the
daemon state until the end of the test and the rados thrashing watches
more carefully in the normal course of doing business?
-Greg

On Tue, May 14, 2019 at 10:40 PM Jos Collin <jcollin@xxxxxxxxxx> wrote:
>
> Hi,
>
> This is a proposal for DaemonWatchdog improvements based on the bug:
> http://tracker.ceph.com/issues/11314. Sending it to ceph-devel for
> getting suggestions.
>
> Current Functionality
> ---------------------
> DaemonWatchdog watches the Ceph daemons for failures. If an extended
> failure is detected (i.e. not intentional), then the watchdog unmount
> file systems and send SIGTERM to all daemons. The duration of an
> extended failure is configurable with  watchdog_daemon_timeout. The
> watchdog_daemon_timeout (default value: 300) is the number of seconds a
> daemon is allowed to be failed before the watchdog barks (unmounting the
> mounts and killing all the daemons).
>
> DaemonWatchdog was originally written for watching the mds (and mon)
> daemons for failures. It unmounts the mounted filesystems and kill the
> mds (and mon) daemons.
>
> Proposed Improvement
> --------------------
> As per John's suggestion here:
> http://tracker.ceph.com/issues/11314#note-1, it would be better if we
> extend this functionality to watch the other daemons too like osd, mon,
> rgw and mgr and do the necessary action or logging (bark) when those
> daemons crashes. We need to make those improvements in watch() and
> bark() functions, so that if the daemon crashes unexpectedly, we detect
> it immediately rather than waiting a long time for a timeout of some
> kind. The bark() function should have different cases to handle
> different daemons crashing. The procedure to be executed for ‘mds’ case
> is present in the bark() function now. But we need to decide the
> procedures for ‘osd’, ‘mon’, ‘rgw’ and ‘mgr’ cases. I think killing the
> daemons and throwing/logging errors or maybe just throwing an error
> would be sufficient.
>
> * At present the class DaemonWatchdog is written in mds_thrash.py, as it
> is specific for watching mds daemons. It would be better if we move it
> out of mds_thrash.py to be generic, to a new file
> qa/tasks/daemonwatchdog.py.
>
> * The current code tries to watch the 'client'?
> (https://github.com/ceph/ceph/blob/master/qa/tasks/mds_thrash.py#L87). I
> have dropped this statement, as it is difficult to watch what the client
> is doing in general.
>
> * There is a suggestion to add the DaemonWatchdog to ceph.py and have it
> always run whenever Ceph is "started".
>
> Thanks,
> Jos Collin