RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Friday, March 25, 2016 4:13 PM
> To: Milosz Tanski
> Cc: Gregory Farnum; Ilya Dryomov; Podoski, Igor; ceph-devel
> Subject: Re: Ceph watchdog-like thing to reduce IO block during process goes
> down by abort()
> 
> On Fri, 25 Mar 2016, Milosz Tanski wrote:
> > On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>
> wrote:
> > > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@xxxxxxxxx>
> wrote:
> > >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@xxxxxxxxxxxxxx
> > >> <Igor.Podoski@xxxxxxxxxxxxxx> wrote:
> > >>> Hi Cephers!
> > >>>
> > >>> Currently when we had a disk failure, assert() and then abort() was
> triggered and process was killed (ABRT). Other osds will eventually mark
> dead one as down, but it depends of heartbeat settings and monitor settings
> (mon_osd_min_down_reporters/mon_osd_min_down_reports). During
> dead-not-marked-as-down osd you can see blocked IO during writes and
> reads.
> > >>>
> > >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is
> about sending MakrMeDown msg to monitor just before osd is going bye-
> bye. It prevents blocked IO in above case, and any other assert that is not on
> message sending path, so I need messenger/pipes/connections working for
> this. I've made some test and it looks good, when I pull out drive from my
> cluster during rados bench, IO blocks for less than 1 second or not at all,
> previously it was > 10 sec (on my cluster settings).
> > >>>
> > >>> Sage pointed me that some time ago was similar PR
> https://github.com/ceph/ceph/pull/6514 and there was a thought about
> ceph-watchdog process, that could monitor osd's and send info directly to
> monitor when they disappear. This would prevent all assert() cases, and
> other ones like kill -9 or similar.
> > >>>
> > >>> I have a few ideas how such functionality could be implemented, so
> my question is - does any of you started already doing something similar?
> > >>>
> > >>> Let's have a brain storm about it!
> > >>>
> > >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> > >>> - I think, I could send message with MarkMeDown payload, but in a
> raw way, not through Messenger path. This could be as good as bad in this
> case.
> > >>> - I could poke osd-neighbor through signal and neighbor will send
> Mark(SignalSender)Down message (this won't work If whole hdd controller
> will be down, all osd will be dead in narrow time window). So it's like instant
> bad-health heartbeat message. Still depends of Messenger send path of
> osd-neighbor.
> > >>
> > >>>
> > >>> External ceph-watchdog:
> > >>> Just like Sage wrote
> https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or
> similar: each osd, during start passes its own PID to ceph-watchdog process
> through shared memory/socket/named pipe (whatever). Ceph-watchdog
> checks if current PID exists, by checking changes in /proc/PID or
> /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or
> folder is changed(missing) it sends MarkThisOsdDown to monitor and that's
> all. But this won't be watchdog strict, rather process down notify.
> > >>>
> > >>> Or maybe both ways PR7740 + external ?
> > >>
> > >> I'm not involved in any of this, but since you asked for a brain
> > >> storm... ;)
> > >>
> > >> Is it worth bothering with the corrupted data structures case at all?
> > >> Trying to handle it from within the aborting ceph-osd process is
> > >> not a very easy thing to do ("raw way, not through Messenger",
> > >> signals, etc) and if you do it wrong, you'd mask the original stack
> > >> trace.  An external ceph-watchdog is yet another entity which has
> > >> to be set up, maintained and accounted for.
> > >>
> > >> Why not just distinguish legitimate/expected errors which we check
> > >> for but currently handle with assert(0) and the actual assert
> > >> failures?  In the vast majority of cases that fall into the former
> > >> bucket all of the internal data structures, including the
> > >> messenger, will be in order and so we can send a MarkMeDown
> message
> > >> and fail gracefully.  Implementing it is just a matter of
> > >> identifying those sites, but that's not a bad exercise to do even on its
> own.
> > >>
> > >> The actual assert failures can abort() as they do now.  Any such
> > >> failure is a serious bug and there's hopefully not too many of them
> > >> to worry about shrinking the timeout to a minimum, unless there are
> > >> hard numbers that prove otherwise, of course.  And if you kill -9
> > >> your OSDs, you deserve to wait for MONs to catch up.  Am I missing
> > >> any use cases here?
> > >
> > > This is something Sam and I have talked about in the past, but
> > > apparently Sage didn't like that idea in
> > > https://github.com/ceph/ceph/pull/6514 and suggested a daemon
> > > watcher instead?
> > > Personally I tend towards building that kind of functionality into
> > > the daemon, although he's right it will never be quite as good at
> > > catching all cases as an external manager. The upside is that we
> > > don't have to worry about the failure cases between the two of them.
> > > ;) -Greg
> >
> > There's no reason the watcher process can't be a child that's kicked
> > off when the OSD startups. If there's a pipe between the two, when the
> > parent goes away the child will get a EOF on reading from the pipe. On
> > Linux you can also do a cute trick to have the child notified when
> > parent quits using prctl(PR_SET_PDEATHSIG, SIG???).
> 
> That does simplify the startup/management piece, but it means one watcher
> per OSD, and since we want the watcher to have an active mon session to
> make the notification quick, it doubles the mon session load.

We could also do it like this:

ceph-watchdog creates named pipe in /var/lib/ceph

osd before abort will:
- open file
- wite its own id 0,1,2...
- close file

ceph-watchdog:
- waits for osd ids on named pipe
- issues a mon_command()  e.g. cmd=[{"prefix": "osd down", "ids": ["1"]}]  to the monitor just like ceph osd down, this can be done by librados from C/python, already have small PoC in python for this, seems to work.

Thanks to above we have one watcher per host, no connections from osd -> watchdog, but looking on the downsides:
- we could hit open files limit
- or anything else using open/write/close
- multiple OSD's could write to pipe in the same time (maybe using small constant writes here won't be an issue, currently checking this)

Regards,
Igor.

> Honestly I don't think the separate daemon is that much of an issue--it's a
> systemd unit file and a pretty simple watchdog process.  The key
> management and systemd enable/activate bit is the part that will be
> annoying.
> 
> sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux