On Thu, 24 Mar 2016, Igor.Podoski@xxxxxxxxxxxxxx wrote: > Ok, so back to slightly modified Sage idea: > > Osd before abort() could write its ID (from **argv) to ceph-watchdog > named pipe. Only one could be hazard here - case when all osd's want to > notify watchdog in the same time. As I wrote before it would not be a > 'watchdog' process, but 'process down notify', so question is do we need > watchdog like thing for some other stuff (in the feature) or process > down notify will be sufficient? I was imagining something that works the other way around, where the watchdog is very simple: - osd (or any daemon) opens a unix domain socket and identifies itself. e.g. "I am osd.123 at 1.2.3.4:6823" - if the socket is closed, the watchdog notifies the mon that there was a failure - the osd (or other daemon) can optionally send a message over the socket changing it's identifier (e.g, if the osd rebinds to a new ip). This way the watchdog doesn't *do* anything except wait for new connections or for connections to close. No polling of PIDs or anything like that. We could figure out where the most common failures are (e.g., op thread timeout, or EIO), but I think in practice that will be hard--there are lots of places where as assert return values are 0. An external watchdog, OTOH, would capture *all* of those cases, and the bugs. The main concern I have is that the model doesn't work well when you have one daemon per host (e.g., microserver on an HDD). Well, it works, but you double the number of monitor sessions. Maybe that's okay, though--it's just an open TCP connection to a mon. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html