On Mon, 26 Mar 2012, Tommi Virtanen wrote: > On Mon, Mar 26, 2012 at 14:55, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> Why do you need to know where osd.42 was last seen just to be able to > >> authoritatively claim osd.42 is 1) down 2) at a new location. > > > > Not "where" but "which". If a ceph-osd stops, and gets restarted (on same > > host, different host, whatever) I want to make sure the _old_ instance is > > marked down (if not already down) in the map and any new instance is not. > > We can't ignore the possibility of a race, as all kinds of things can make > > the notification path go faster or slower than a replacement ceph-osd > > might. addr is just one possible unique identifier to achieve that; we > > could also use the up_from epoch or anything else unique to the instance. > > The problem is it is unknown until the process has been alive for a while, > > so a wrapper can't see it easily. > > > > Well, the addr is (ip, port, nonce) where nonce is actually the pid, so if > > we can tell what ip ceph-osd will bind to that might work... > > Okay so if I understand correctly, you're worried that this: > > - osd.42 pid 123 dies > - small utility sends "osd.42 is dead" > - osd.42 pid 234 starts up > > is observed as this: > > - osd.42 pid 123 dies > - osd.42 pid 234 starts up > - small utility sends "osd.42 is dead" > > ? Yeah, or in fact the second ordering is what really happens, because the 234 start is done by someone/something else. > I'd expect the last two to happen in separate rounds of paxos, or > something, and thus not be reorderable (the client won't reorder them; > it runs the notifier first to completion, only then starts the new > daemon instance).. but if they can be reordered like that, then it > seems like a valid concern. Perhaps the small utility "osd.42 is dead" > should be just a hint, "osd.42 might be dead, please check and flag it > down asap if it doesn't respond"? The problem is a remote 'please check' generally involves a timeout, and we already do that with the OSD pings. Maybe I'm being overly paranoid here and the pid as a unique id is sufficient. e.g., #!/bin/sh ...annoying arg parsing... ceph-osd -i $id ... & pid=$! wait ceph osd stopped $id $pid $? & ? sage > > >> > execv() in the signal handler, OTOH, is easy. Is it that offensive? > >> > > >> > The other nice thing about that is that the failure notification can be > >> > informative for free: "osd.42 stopped: got SIGTERM", "osd.42 stopped: > >> > failed assert at foo.cc:1234", etc. > >> > >> I'm worried about all the things that want us to exit(3). Code > >> coverage, valgrind, what not. > >> > >> If you're thinking of shipping the crash reason to monitors, I think > >> you're once again trying to replace a bunch of a sysadmin's toolkit > >> with Ceph-internal features. Ones that they can't use with all the > >> non-Ceph things they run on their storage cluster anyway, like statsd, > >> sshd, etc. I feel confident in saying Ceph will lose that race, in > >> stability, functionality and familiarity to target audience. > > > > I see your point, but I don't know what in the sysadmin's toolkit would > > actually do this (now or in the future). For an external agent/wrapper to > > work it would need to: > > > > 1- start ceph-osd > > 2- query it for it's cookie/nonce (or infer what it will be) > > 3- wait for the process to stop/crash > > 4- send a notification (using ceph specific tool) > > > > Assuming for a moment that we do want a mon notification if ceph-osd, say, > > fails an assert, what would you expect the process to look like? > > Oh, that was "If you're thinking of shipping the crash reason to > monitors, ...". As in, if you're thinking of making "ceph -s" or > something say "osd.42 stopped: Crashed due to SIGSEGV". That part > people have tools for, and it's in no way ceph specific. > > I did not mean that part of the message to apply to the failure > discovery optimization. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >