On Mon, Oct 24, 2016 at 3:31 AM, Ranjan Ghosh <ghosh@xxxxxx> wrote: > Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1. > According to this: > > http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/ > > the default is already 1, though. I don't remember the value before I > changed it everywhere, so I can't say for sure now. But I think it was 2 > despite what the docs say. Whatever. It's now 1 everywhere. > > Another somewhat weird thing I found was: When I check the values of an > OSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I see an > entry "mon osd min down reporters". I can even change it. But according to > the docs, this is just a setting for monitors. Why does it appear there? > Does it influence anything? If not: Is there a way to only show relevant > config entries for a daemon? > > Then, when checking the doc page mentioned above and reading the > descriptions of the multitude of config settings, I wonder: How can I > properly estimate the time until my cluster works again? Since I get hung > requests until the failed node is finally declared *down*, this time is > obviously quite important for me. What exactly is the sequence of events > when a node fails (i.e. someone accidentally hits the power off button). My > (possibly totally wrong & dumb) idea: > > 1) osd0 fails/doesn't answer > > 2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus, after 6 > seconds max. osd1 notices osd0 *could be* down. > > 3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 is > definitely down. 20 seconds *total*, not additional. > > 4) Another 120 seconds might elapse ( osd mon report interval max) until > osd1 reports the bad news to the monitor. Nope. I don't remember how this config setting is used but there's no way it will wait this long. There's a "min" parameter that corresponds to it; I think this 120 seconds is how long it will wait if there are *no changes* and it relates to the PG stats, not to things like failure reports? > > 5) The monitor gets the information about failed osd0 and since "mon osd min > down reporters" is 1, this single osd is sufficent for the monitor to > believe the bad news that osd0 is unresponsive. > > 6) But since "mon osd min down reports" is 3, all the stuff up until now has > to happen 3 times in a row until the monitor finally realizes osd0 is > *really* unresponsive. The OSD will report much more frequently once it decides a peer is down. I think every heartbeat interval. > > 7) After another 900 seconds (mon osd report timeout) of waiting in hope of > another news that osd0 is still/back alive, the monitor marks osd0 as down This is mostly unrelated and not part of the normal process. (If it happens that *all* the OSDs die simultaneously, none of them will report failures to the monitor, and so it won't detect any problems! So we have a very large timeout for that.) > > 8) After another 300 seconds (mon osd down out interval) the monitor marks > osd0 as down+out This is the time between when a monitor marks an OSD "down" (not currently serving data) and "out" (not considered *responsible* for data by the cluster). IO will resume once the OSD is down (assuming the PG has its minimum number of live replicas); it's just that data will be re-replicated to other nodes once an OSD is marked "out". Incidentally, if OSDs being accidentally turned off for extended periods of time is a going concern and you don't have a lot of write IO happening, you probably want to increase that value by quite a lot. It's not a good situation for a Ceph cluster to be in, though. :/ > So, after my possibly very naive understanding, it takes 3*(6+20+120) + 900 > + 300 seconds from the event "someone accidentally hit the power off switch" > to "osd0 is marked down+out". > > Correct? I expect not. Which config variables did I misunderstand? As above. Total time until the monitor declares an OSD dead and OSDs start peering for the new state should be around (20+2*6) +- 6 seconds if there's only one peer. This can fluctuate if you have a lot of OSD flapping or reports that turn out to be incorrect, but it doesn't sound like that's happening to you. -Greg > > > Thank you > > Ranjan > > > > > Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles: > > mon_osd_min_down_reporters by default set to 2 > > I guess you’ll have to set it to 1 in your case > > JC > > On Sep 29, 2016, at 08:16, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > I think the problem is that Ceph requires a certain number of OSDs or a > certain number of reports of failure before it marks an OSD down. These > thresholds are not tuned for a 2-OSD cluster; you probably want to set them > to 1. > Also keep in mind that the OSDs provide a grace period of 20-30 seconds > before they'll report somebody down; this helps prevent spurious recovery > but means you will get paused IO on an unclean shutdown. > > I can't recall the exact config options off-hand, but it's something like > "mon osd min down reports". Search the docs for that. :) > -Greg > > On Thursday, September 29, 2016, Peter Maloney > <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote: >> >> On 09/29/16 14:07, Ranjan Ghosh wrote: >> > Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last questions >> > on this issue: >> > >> > 1) When the first node is coming back up, I can just call "ceph osd up >> > 0" and Ceph will start auto-repairing everything everything, right? >> > That is, if there are e.g. new files that were created during the time >> > the first node was down, they will (sooner or later) get replicated >> > there? >> Nope, there is no "ceph osd up <id>"; you just start the osd, and it >> already gets recognized as up. (if you don't like this, you set it out, >> not just down; and there is a "ceph osd in <id>" to undo that.) >> > >> > 2) If I don't call "osd down" manually (perhaps at the weekend when >> > I'm not at the office) when a node dies - did I understand correctly >> > that the "hanging" I experienced is temporary and that after a few >> > minutes (don't want to try out now) the node should also go down >> > automatically? >> I believe so, yes. >> >> Also, FYI, RBD images don't seem to have this issue, and work right away >> on a 3 osd cluster. Maybe cephfs would also work better with a 3rd osd, >> even an empty one (weight=0). (and I had an unresolved issue testing the >> same with cephfs on my virtual test cluster) >> > >> > BR, >> > Ranjan >> > >> > >> > Am 29.09.2016 um 13:00 schrieb Peter Maloney: >> >> >> >> And also you could try: >> >> ceph osd down <osd id> >> > >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com