Re: Ceph Very Small Cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1. According to this:

http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/

the default is already 1, though. I don't remember the value before I changed it everywhere, so I can't say for sure now. But I think it was 2 despite what the docs say. Whatever. It's now 1 everywhere.

Another somewhat weird thing I found was: When I check the values of an OSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I see an entry "mon osd min down reporters". I can even change it. But according to the docs, this is just a setting for monitors. Why does it appear there? Does it influence anything? If not: Is there a way to only show relevant config entries for a daemon?

Then, when checking the doc page mentioned above and reading the descriptions of the multitude of config settings, I wonder: How can I properly estimate the time until my cluster works again? Since I get hung requests until the failed node is finally declared *down*, this time is obviously quite important for me. What exactly is the sequence of events when a node fails (i.e. someone accidentally hits the power off button). My (possibly totally wrong & dumb) idea:

1) osd0 fails/doesn't answer

2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus, after 6 seconds max. osd1 notices osd0 *could be* down.

3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 is definitely down.

4) Another 120 seconds might elapse ( osd mon report interval max) until osd1 reports the bad news to the monitor.

5) The monitor gets the information about failed osd0 and since "mon osd min down reporters" is 1, this single osd is sufficent for the monitor to believe the bad news that osd0 is unresponsive.

6) But since "mon osd min down reports" is 3, all the stuff up until now has to happen 3 times in a row until the monitor finally realizes osd0 is *really* unresponsive.

7) After another 900 seconds  (mon osd report timeout) of waiting in hope of another news that osd0 is still/back alive, the monitor marks osd0 as down

8) After another 300 seconds (mon osd down out interval) the monitor marks osd0 as down+out


So, after my possibly very naive understanding, it takes 3*(6+20+120) + 900 + 300 seconds from the event "someone accidentally hit the power off switch" to "osd0 is marked down+out".

Correct? I expect not. Which config variables did I misunderstand?


Thank you

Ranjan




Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles:
mon_osd_min_down_reporters by default set to 2

I guess you’ll have to set it to 1 in your case

JC

On Sep 29, 2016, at 08:16, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

I think the problem is that Ceph requires a certain number of OSDs or a certain number of reports of failure before it marks an OSD down. These thresholds are not tuned for a 2-OSD cluster; you probably want to set them to 1.
Also keep in mind that the OSDs provide a grace period of 20-30 seconds before they'll report somebody down; this helps prevent spurious recovery but means you will get paused IO on an unclean shutdown.

I can't recall the exact config options off-hand, but it's something like "mon osd min down reports". Search the docs for that. :)
-Greg

On Thursday, September 29, 2016, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
On 09/29/16 14:07, Ranjan Ghosh wrote:
> Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last questions
> on this issue:
>
> 1) When the first node is coming back up, I can just call "ceph osd up
> 0" and Ceph will start auto-repairing everything everything, right?
> That is, if there are e.g. new files that were created during the time
> the first node was down, they will (sooner or later) get replicated
> there?
Nope, there is no "ceph osd up <id>"; you just start the osd, and it
already gets recognized as up. (if you don't like this, you set it out,
not just down; and there is a "ceph osd in <id>" to undo that.)
>
> 2) If I don't call "osd down" manually (perhaps at the weekend when
> I'm not at the office) when a node dies - did I understand correctly
> that the "hanging" I experienced is temporary and that after a few
> minutes (don't want to try out now) the node should also go down
> automatically?
I believe so, yes.

Also, FYI, RBD images don't seem to have this issue, and work right away
on a 3 osd cluster. Maybe cephfs would also work better with a 3rd osd,
even an empty one (weight=0). (and I had an unresolved issue testing the
same with cephfs on my virtual test cluster)
>
> BR,
> Ranjan
>
>
> Am 29.09.2016 um 13:00 schrieb Peter Maloney:
>>
>> And also you could try:
>>      ceph osd down <osd id>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux