> -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Igor.Podoski@xxxxxxxxxxxxxx > Sent: Tuesday, March 29, 2016 8:07 AM > To: Sage Weil; Milosz Tanski > Cc: Gregory Farnum; Ilya Dryomov; ceph-devel > Subject: RE: Ceph watchdog-like thing to reduce IO block during process goes > down by abort() > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > Sent: Friday, March 25, 2016 4:13 PM > > To: Milosz Tanski > > Cc: Gregory Farnum; Ilya Dryomov; Podoski, Igor; ceph-devel > > Subject: Re: Ceph watchdog-like thing to reduce IO block during > > process goes down by abort() > > > > On Fri, 25 Mar 2016, Milosz Tanski wrote: > > > On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum > <gfarnum@xxxxxxxxxx> > > wrote: > > > > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov > > > > <idryomov@xxxxxxxxx> > > wrote: > > > >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@xxxxxxxxxxxxxx > > > >> <Igor.Podoski@xxxxxxxxxxxxxx> wrote: > > > >>> Hi Cephers! > > > >>> > > > >>> Currently when we had a disk failure, assert() and then abort() > > > >>> was > > triggered and process was killed (ABRT). Other osds will eventually > > mark dead one as down, but it depends of heartbeat settings and > > monitor settings > > (mon_osd_min_down_reporters/mon_osd_min_down_reports). During > > dead-not-marked-as-down osd you can see blocked IO during writes and > reads. > > > >>> > > > >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which > > > >>> is > > about sending MakrMeDown msg to monitor just before osd is going bye- > > bye. It prevents blocked IO in above case, and any other assert that > > is not on message sending path, so I need messenger/pipes/connections > > working for this. I've made some test and it looks good, when I pull > > out drive from my cluster during rados bench, IO blocks for less than > > 1 second or not at all, previously it was > 10 sec (on my cluster settings). > > > >>> > > > >>> Sage pointed me that some time ago was similar PR > > https://github.com/ceph/ceph/pull/6514 and there was a thought about > > ceph-watchdog process, that could monitor osd's and send info directly > > to monitor when they disappear. This would prevent all assert() cases, > > and other ones like kill -9 or similar. > > > >>> > > > >>> I have a few ideas how such functionality could be implemented, > > > >>> so > > my question is - does any of you started already doing something similar? > > > >>> > > > >>> Let's have a brain storm about it! > > > >>> > > > >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism: > > > >>> - I think, I could send message with MarkMeDown payload, but in > > > >>> a > > raw way, not through Messenger path. This could be as good as bad in > > this case. > > > >>> - I could poke osd-neighbor through signal and neighbor will > > > >>> send > > Mark(SignalSender)Down message (this won't work If whole hdd > > controller will be down, all osd will be dead in narrow time window). > > So it's like instant bad-health heartbeat message. Still depends of > > Messenger send path of osd-neighbor. > > > >> > > > >>> > > > >>> External ceph-watchdog: > > > >>> Just like Sage wrote > > https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or > > similar: each osd, during start passes its own PID to ceph-watchdog > > process through shared memory/socket/named pipe (whatever). > > Ceph-watchdog checks if current PID exists, by checking changes in > > /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle > > this). When file or folder is changed(missing) it sends > > MarkThisOsdDown to monitor and that's all. But this won't be watchdog > strict, rather process down notify. > > > >>> > > > >>> Or maybe both ways PR7740 + external ? > > > >> > > > >> I'm not involved in any of this, but since you asked for a brain > > > >> storm... ;) > > > >> > > > >> Is it worth bothering with the corrupted data structures case at all? > > > >> Trying to handle it from within the aborting ceph-osd process is > > > >> not a very easy thing to do ("raw way, not through Messenger", > > > >> signals, etc) and if you do it wrong, you'd mask the original > > > >> stack trace. An external ceph-watchdog is yet another entity > > > >> which has to be set up, maintained and accounted for. > > > >> > > > >> Why not just distinguish legitimate/expected errors which we > > > >> check for but currently handle with assert(0) and the actual > > > >> assert failures? In the vast majority of cases that fall into > > > >> the former bucket all of the internal data structures, including > > > >> the messenger, will be in order and so we can send a MarkMeDown > > message > > > >> and fail gracefully. Implementing it is just a matter of > > > >> identifying those sites, but that's not a bad exercise to do even > > > >> on its > > own. > > > >> > > > >> The actual assert failures can abort() as they do now. Any such > > > >> failure is a serious bug and there's hopefully not too many of > > > >> them to worry about shrinking the timeout to a minimum, unless > > > >> there are hard numbers that prove otherwise, of course. And if > > > >> you kill -9 your OSDs, you deserve to wait for MONs to catch up. > > > >> Am I missing any use cases here? > > > > > > > > This is something Sam and I have talked about in the past, but > > > > apparently Sage didn't like that idea in > > > > https://github.com/ceph/ceph/pull/6514 and suggested a daemon > > > > watcher instead? > > > > Personally I tend towards building that kind of functionality into > > > > the daemon, although he's right it will never be quite as good at > > > > catching all cases as an external manager. The upside is that we > > > > don't have to worry about the failure cases between the two of them. > > > > ;) -Greg > > > > > > There's no reason the watcher process can't be a child that's kicked > > > off when the OSD startups. If there's a pipe between the two, when > > > the parent goes away the child will get a EOF on reading from the > > > pipe. On Linux you can also do a cute trick to have the child > > > notified when parent quits using prctl(PR_SET_PDEATHSIG, SIG???). > > > > That does simplify the startup/management piece, but it means one > > watcher per OSD, and since we want the watcher to have an active mon > > session to make the notification quick, it doubles the mon session load. > > We could also do it like this: > > ceph-watchdog creates named pipe in /var/lib/ceph > > osd before abort will: > - open file > - wite its own id 0,1,2... > - close file > > ceph-watchdog: > - waits for osd ids on named pipe > - issues a mon_command() e.g. cmd=[{"prefix": "osd down", "ids": ["1"]}] to > the monitor just like ceph osd down, this can be done by librados from > C/python, already have small PoC in python for this, seems to work. Of course it keeps connection to monitor open/active all the time. > Thanks to above we have one watcher per host, no connections from osd -> > watchdog, but looking on the downsides: > - we could hit open files limit > - or anything else using open/write/close > - multiple OSD's could write to pipe in the same time (maybe using small > constant writes here won't be an issue, currently checking this) Additional advantage by writing OSD id to fifo - we could easily do a backup mechanism in systemd for open/write/close fail. > Regards, > Igor. > > > Honestly I don't think the separate daemon is that much of an > > issue--it's a systemd unit file and a pretty simple watchdog process. > > The key management and systemd enable/activate bit is the part that > > will be annoying. > > > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html