Re: Ceph Very Small Cluster

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 24 Oct 2016 07:02:31 -0700

On Mon, Oct 24, 2016 at 3:31 AM, Ranjan Ghosh <ghosh@xxxxxx> wrote:
> Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1.
> According to this:
>
> http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/
>
> the default is already 1, though. I don't remember the value before I
> changed it everywhere, so I can't say for sure now. But I think it was 2
> despite what the docs say. Whatever. It's now 1 everywhere.
>
> Another somewhat weird thing I found was: When I check the values of an
> OSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I see an
> entry "mon osd min down reporters". I can even change it. But according to
> the docs, this is just a setting for monitors. Why does it appear there?
> Does it influence anything? If not: Is there a way to only show relevant
> config entries for a daemon?
>
> Then, when checking the doc page mentioned above and reading the
> descriptions of the multitude of config settings, I wonder: How can I
> properly estimate the time until my cluster works again? Since I get hung
> requests until the failed node is finally declared *down*, this time is
> obviously quite important for me. What exactly is the sequence of events
> when a node fails (i.e. someone accidentally hits the power off button). My
> (possibly totally wrong & dumb) idea:
>
> 1) osd0 fails/doesn't answer
>
> 2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus, after 6
> seconds max. osd1 notices osd0 *could be* down.
>
> 3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 is
> definitely down.

20 seconds *total*, not additional.

>
> 4) Another 120 seconds might elapse ( osd mon report interval max) until
> osd1 reports the bad news to the monitor.

Nope. I don't remember how this config setting is used but there's no
way it will wait this long. There's a "min" parameter that corresponds
to it; I think this 120 seconds is how long it will wait if there are
*no changes* and it relates to the PG stats, not to things like
failure reports?

>
> 5) The monitor gets the information about failed osd0 and since "mon osd min
> down reporters" is 1, this single osd is sufficent for the monitor to
> believe the bad news that osd0 is unresponsive.
>
> 6) But since "mon osd min down reports" is 3, all the stuff up until now has
> to happen 3 times in a row until the monitor finally realizes osd0 is
> *really* unresponsive.

The OSD will report much more frequently once it decides a peer is
down. I think every heartbeat interval.

>
> 7) After another 900 seconds  (mon osd report timeout) of waiting in hope of
> another news that osd0 is still/back alive, the monitor marks osd0 as down

This is mostly unrelated and not part of the normal process. (If it
happens that *all* the OSDs die simultaneously, none of them will
report failures to the monitor, and so it won't detect any problems!
So we have a very large timeout for that.)

>
> 8) After another 300 seconds (mon osd down out interval) the monitor marks
> osd0 as down+out

This is the time between when a monitor marks an OSD "down" (not
currently serving data) and "out" (not considered *responsible* for
data by the cluster). IO will resume once the OSD is down (assuming
the PG has its minimum number of live replicas); it's just that data
will be re-replicated to other nodes once an OSD is marked "out".
Incidentally, if OSDs being accidentally turned off for extended
periods of time is a going concern and you don't have a lot of write
IO happening, you probably want to increase that value by quite a lot.
It's not a good situation for a Ceph cluster to be in, though. :/

> So, after my possibly very naive understanding, it takes 3*(6+20+120) + 900
> + 300 seconds from the event "someone accidentally hit the power off switch"
> to "osd0 is marked down+out".
>
> Correct? I expect not. Which config variables did I misunderstand?

As above. Total time until the monitor declares an OSD dead and OSDs
start peering for the new state should be around (20+2*6) +- 6 seconds
if there's only one peer. This can fluctuate if you have a lot of OSD
flapping or reports that turn out to be incorrect, but it doesn't
sound like that's happening to you.
-Greg

>
>
> Thank you
>
> Ranjan
>
>
>
>
> Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles:
>
> mon_osd_min_down_reporters by default set to 2
>
> I guess you’ll have to set it to 1 in your case
>
> JC
>
> On Sep 29, 2016, at 08:16, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> I think the problem is that Ceph requires a certain number of OSDs or a
> certain number of reports of failure before it marks an OSD down. These
> thresholds are not tuned for a 2-OSD cluster; you probably want to set them
> to 1.
> Also keep in mind that the OSDs provide a grace period of 20-30 seconds
> before they'll report somebody down; this helps prevent spurious recovery
> but means you will get paused IO on an unclean shutdown.
>
> I can't recall the exact config options off-hand, but it's something like
> "mon osd min down reports". Search the docs for that. :)
> -Greg
>
> On Thursday, September 29, 2016, Peter Maloney
> <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On 09/29/16 14:07, Ranjan Ghosh wrote:
>> > Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last questions
>> > on this issue:
>> >
>> > 1) When the first node is coming back up, I can just call "ceph osd up
>> > 0" and Ceph will start auto-repairing everything everything, right?
>> > That is, if there are e.g. new files that were created during the time
>> > the first node was down, they will (sooner or later) get replicated
>> > there?
>> Nope, there is no "ceph osd up <id>"; you just start the osd, and it
>> already gets recognized as up. (if you don't like this, you set it out,
>> not just down; and there is a "ceph osd in <id>" to undo that.)
>> >
>> > 2) If I don't call "osd down" manually (perhaps at the weekend when
>> > I'm not at the office) when a node dies - did I understand correctly
>> > that the "hanging" I experienced is temporary and that after a few
>> > minutes (don't want to try out now) the node should also go down
>> > automatically?
>> I believe so, yes.
>>
>> Also, FYI, RBD images don't seem to have this issue, and work right away
>> on a 3 osd cluster. Maybe cephfs would also work better with a 3rd osd,
>> even an empty one (weight=0). (and I had an unresolved issue testing the
>> same with cephfs on my virtual test cluster)
>> >
>> > BR,
>> > Ranjan
>> >
>> >
>> > Am 29.09.2016 um 13:00 schrieb Peter Maloney:
>> >>
>> >> And also you could try:
>> >>      ceph osd down <osd id>
>> >
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com