On Mon, 26 Mar 2012, Tommi Virtanen wrote: > On Mon, Mar 26, 2012 at 13:16, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> Perhaps a separate executable that sends "osd.42 is now definitely > >> down" will be good enough? Hopefully you don't have two osd.42's > >> around, anyway. And if you want that, instead of execing ceph-osd, you > >> do a fork & exec, wait in the parent, then exec that thing that marks > >> it down. For upstart (and often for others too), there's a "after the > >> service exits" hook where we could also plug that in, if we wanted to. > > > > ...except that the way to reliably mark down a particular osd.42 requires > > data that's private to the ceph-osd instance, and unknown until it starts > > up and joins the cluster. That makes it awkward to implement any kind of > > wrapper because you have to pass it a cookie using some side-channel. > > Why do you need to know where osd.42 was last seen just to be able to > authoritatively claim osd.42 is 1) down 2) at a new location. Not "where" but "which". If a ceph-osd stops, and gets restarted (on same host, different host, whatever) I want to make sure the _old_ instance is marked down (if not already down) in the map and any new instance is not. We can't ignore the possibility of a race, as all kinds of things can make the notification path go faster or slower than a replacement ceph-osd might. addr is just one possible unique identifier to achieve that; we could also use the up_from epoch or anything else unique to the instance. The problem is it is unknown until the process has been alive for a while, so a wrapper can't see it easily. Well, the addr is (ip, port, nonce) where nonce is actually the pid, so if we can tell what ip ceph-osd will bind to that might work... > > execv() in the signal handler, OTOH, is easy. Is it that offensive? > > > > The other nice thing about that is that the failure notification can be > > informative for free: "osd.42 stopped: got SIGTERM", "osd.42 stopped: > > failed assert at foo.cc:1234", etc. > > I'm worried about all the things that want us to exit(3). Code > coverage, valgrind, what not. > > If you're thinking of shipping the crash reason to monitors, I think > you're once again trying to replace a bunch of a sysadmin's toolkit > with Ceph-internal features. Ones that they can't use with all the > non-Ceph things they run on their storage cluster anyway, like statsd, > sshd, etc. I feel confident in saying Ceph will lose that race, in > stability, functionality and familiarity to target audience. I see your point, but I don't know what in the sysadmin's toolkit would actually do this (now or in the future). For an external agent/wrapper to work it would need to: 1- start ceph-osd 2- query it for it's cookie/nonce (or infer what it will be) 3- wait for the process to stop/crash 4- send a notification (using ceph specific tool) Assuming for a moment that we do want a mon notification if ceph-osd, say, fails an assert, what would you expect the process to look like? sage