Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

Sage Weil <sweil@xxxxxxxxxx> · Fri, 25 Mar 2016 11:12:34 -0400 (EDT)

On Fri, 25 Mar 2016, Milosz Tanski wrote:
> On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@xxxxxxxxxxxxxx
> >> <Igor.Podoski@xxxxxxxxxxxxxx> wrote:
> >>> Hi Cephers!
> >>>
> >>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
> >>>
> >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
> >>>
> >>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
> >>>
> >>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
> >>>
> >>> Let's have a brain storm about it!
> >>>
> >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> >>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
> >>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
> >>
> >>>
> >>> External ceph-watchdog:
> >>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
> >>>
> >>> Or maybe both ways PR7740 + external ?
> >>
> >> I'm not involved in any of this, but since you asked for a brain
> >> storm... ;)
> >>
> >> Is it worth bothering with the corrupted data structures case at all?
> >> Trying to handle it from within the aborting ceph-osd process is not a
> >> very easy thing to do ("raw way, not through Messenger", signals, etc)
> >> and if you do it wrong, you'd mask the original stack trace.  An
> >> external ceph-watchdog is yet another entity which has to be set up,
> >> maintained and accounted for.
> >>
> >> Why not just distinguish legitimate/expected errors which we check for
> >> but currently handle with assert(0) and the actual assert failures?  In
> >> the vast majority of cases that fall into the former bucket all of the
> >> internal data structures, including the messenger, will be in order and
> >> so we can send a MarkMeDown message and fail gracefully.  Implementing
> >> it is just a matter of identifying those sites, but that's not a bad
> >> exercise to do even on its own.
> >>
> >> The actual assert failures can abort() as they do now.  Any such
> >> failure is a serious bug and there's hopefully not too many of them to
> >> worry about shrinking the timeout to a minimum, unless there are hard
> >> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
> >> you deserve to wait for MONs to catch up.  Am I missing any use cases
> >> here?
> >
> > This is something Sam and I have talked about in the past, but
> > apparently Sage didn't like that idea in
> > https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
> > instead?
> > Personally I tend towards building that kind of functionality into the
> > daemon, although he's right it will never be quite as good at catching
> > all cases as an external manager. The upside is that we don't have to
> > worry about the failure cases between the two of them. ;)
> > -Greg
> 
> There's no reason the watcher process can't be a child that's kicked
> off when the OSD startups. If there's a pipe between the two, when the
> parent goes away the child will get a EOF on reading from the pipe. On
> Linux you can also do a cute trick to have the child notified when
> parent quits using prctl(PR_SET_PDEATHSIG, SIG???).

That does simplify the startup/management piece, but it means one watcher 
per OSD, and since we want the watcher to have an active mon session to 
make the notification quick, it doubles the mon session load.

Honestly I don't think the separate daemon is that much of an issue--it's 
a systemd unit file and a pretty simple watchdog process.  The key 
management and systemd enable/activate bit is the part that will be 
annoying.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html