Re: osd shutdown notification

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 26 Mar 2012 14:55:45 -0700 (PDT)

On Mon, 26 Mar 2012, Tommi Virtanen wrote:
> On Mon, Mar 26, 2012 at 13:16, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> Perhaps a separate executable that sends "osd.42 is now definitely
> >> down" will be good enough? Hopefully you don't have two osd.42's
> >> around, anyway. And if you want that, instead of execing ceph-osd, you
> >> do a fork & exec, wait in the parent, then exec that thing that marks
> >> it down. For upstart (and often for others too), there's a "after the
> >> service exits" hook where we could also plug that in, if we wanted to.
> >
> > ...except that the way to reliably mark down a particular osd.42 requires
> > data that's private to the ceph-osd instance, and unknown until it starts
> > up and joins the cluster.  That makes it awkward to implement any kind of
> > wrapper because you have to pass it a cookie using some side-channel.
> 
> Why do you need to know where osd.42 was last seen just to be able to
> authoritatively claim osd.42 is 1) down 2) at a new location.

Not "where" but "which".  If a ceph-osd stops, and gets restarted (on same 
host, different host, whatever) I want to make sure the _old_ instance is 
marked down (if not already down) in the map and any new instance is not.  
We can't ignore the possibility of a race, as all kinds of things can make 
the notification path go faster or slower than a replacement ceph-osd 
might.  addr is just one possible unique identifier to achieve that; we 
could also use the up_from epoch or anything else unique to the instance.  
The problem is it is unknown until the process has been alive for a while, 
so a wrapper can't see it easily.

Well, the addr is (ip, port, nonce) where nonce is actually the pid, so if 
we can tell what ip ceph-osd will bind to that might work...

> > execv() in the signal handler, OTOH, is easy.  Is it that offensive?
> >
> > The other nice thing about that is that the failure notification can be
> > informative for free: "osd.42 stopped: got SIGTERM", "osd.42 stopped:
> > failed assert at foo.cc:1234", etc.
> 
> I'm worried about all the things that want us to exit(3). Code
> coverage, valgrind, what not.
>
> If you're thinking of shipping the crash reason to monitors, I think
> you're once again trying to replace a bunch of a sysadmin's toolkit
> with Ceph-internal features. Ones that they can't use with all the
> non-Ceph things they run on their storage cluster anyway, like statsd,
> sshd, etc. I feel confident in saying Ceph will lose that race, in
> stability, functionality and familiarity to target audience.

I see your point, but I don't know what in the sysadmin's toolkit would 
actually do this (now or in the future).  For an external agent/wrapper to 
work it would need to:

 1- start ceph-osd
 2- query it for it's cookie/nonce (or infer what it will be)
 3- wait for the process to stop/crash
 4- send a notification (using ceph specific tool)

Assuming for a moment that we do want a mon notification if ceph-osd, say, 
fails an assert, what would you expect the process to look like?

sage