Re: osd shutdown notification

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 26 Mar 2012 15:41:00 -0700 (PDT)

On Mon, 26 Mar 2012, Tommi Virtanen wrote:
> On Mon, Mar 26, 2012 at 14:55, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> Why do you need to know where osd.42 was last seen just to be able to
> >> authoritatively claim osd.42 is 1) down 2) at a new location.
> >
> > Not "where" but "which".  If a ceph-osd stops, and gets restarted (on same
> > host, different host, whatever) I want to make sure the _old_ instance is
> > marked down (if not already down) in the map and any new instance is not.
> > We can't ignore the possibility of a race, as all kinds of things can make
> > the notification path go faster or slower than a replacement ceph-osd
> > might.  addr is just one possible unique identifier to achieve that; we
> > could also use the up_from epoch or anything else unique to the instance.
> > The problem is it is unknown until the process has been alive for a while,
> > so a wrapper can't see it easily.
> >
> > Well, the addr is (ip, port, nonce) where nonce is actually the pid, so if
> > we can tell what ip ceph-osd will bind to that might work...
> 
> Okay so if I understand correctly, you're worried that this:
> 
> - osd.42 pid 123 dies
> - small utility sends "osd.42 is dead"
> - osd.42 pid 234 starts up
> 
> is observed as this:
> 
> - osd.42 pid 123 dies
> - osd.42 pid 234 starts up
> - small utility sends "osd.42 is dead"
> 
> ?

Yeah, or in fact the second ordering is what really happens, because the 
234 start is done by someone/something else.

> I'd expect the last two to happen in separate rounds of paxos, or
> something, and thus not be reorderable (the client won't reorder them;
> it runs the notifier first to completion, only then starts the new
> daemon instance).. but if they can be reordered like that, then it
> seems like a valid concern. Perhaps the small utility "osd.42 is dead"
> should be just a hint, "osd.42 might be dead, please check and flag it
> down asap if it doesn't respond"?

The problem is a remote 'please check' generally involves a timeout, and 
we already do that with the OSD pings.

Maybe I'm being overly paranoid here and the pid as a unique id is 
sufficient.  e.g.,

 #!/bin/sh
 ...annoying arg parsing...
 ceph-osd -i $id ... &
 pid=$!
 wait
 ceph osd stopped $id $pid $? &

?

sage

> 
> >> > execv() in the signal handler, OTOH, is easy.  Is it that offensive?
> >> >
> >> > The other nice thing about that is that the failure notification can be
> >> > informative for free: "osd.42 stopped: got SIGTERM", "osd.42 stopped:
> >> > failed assert at foo.cc:1234", etc.
> >>
> >> I'm worried about all the things that want us to exit(3). Code
> >> coverage, valgrind, what not.
> >>
> >> If you're thinking of shipping the crash reason to monitors, I think
> >> you're once again trying to replace a bunch of a sysadmin's toolkit
> >> with Ceph-internal features. Ones that they can't use with all the
> >> non-Ceph things they run on their storage cluster anyway, like statsd,
> >> sshd, etc. I feel confident in saying Ceph will lose that race, in
> >> stability, functionality and familiarity to target audience.
> >
> > I see your point, but I don't know what in the sysadmin's toolkit would
> > actually do this (now or in the future).  For an external agent/wrapper to
> > work it would need to:
> >
> >  1- start ceph-osd
> >  2- query it for it's cookie/nonce (or infer what it will be)
> >  3- wait for the process to stop/crash
> >  4- send a notification (using ceph specific tool)
> >
> > Assuming for a moment that we do want a mon notification if ceph-osd, say,
> > fails an assert, what would you expect the process to look like?
> 
> Oh, that was "If you're thinking of shipping the crash reason to
> monitors, ...". As in, if you're thinking of making "ceph -s" or
> something say "osd.42 stopped: Crashed due to SIGSEGV". That part
> people have tools for, and it's in no way ceph specific.
> 
> I did not mean that part of the message to apply to the failure
> discovery optimization.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>