Re: Ceph Very Small Cluster

Ranjan Ghosh <ghosh@xxxxxx> · Mon, 24 Oct 2016 12:31:10 +0200



    Thanks JC & Greg, I've changed the "mon osd min down
      reporters" to 1. According to this:
    http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/
    the default is already 1, though. I don't remember the value
      before I changed it everywhere, so I can't say for sure now. But I
      think it was 2 despite what the docs say. Whatever. It's now 1
      everywhere. 

    
    Another somewhat weird thing I found was: When I check the values
      of an OSD(!) with "ceph daemon osd.0 config show | sort | grep
      mon_osd" I see an entry "mon osd min down reporters". I can even
      change it. But according to the docs, this is just a setting for
      monitors. Why does it appear there? Does it influence anything? If
      not: Is there a way to only show relevant config entries for a
      daemon?
    Then, when checking the doc page mentioned above and reading the
      descriptions of the multitude of config settings, I wonder: How
      can I properly estimate the time until my cluster works again?
      Since I get hung requests until the failed node is finally
      declared *down*, this time is obviously quite important for me.
      What exactly is the sequence of events when a node fails (i.e.
      someone accidentally hits the power off button). My (possibly
      totally wrong & dumb) idea:
    1) osd0 fails/doesn't answer

    
    2) osd1 pings osd0 every 6 seconds (
      
      osd heartbeat interval). Thus, after 6 seconds max. osd1 notices
      osd0 *could be* down.

    
    3) After another 20 seconds (osd heartbeat grace), osd1 decides
      osd0 is definitely down.
    4) Another 120 seconds might elapse (
      
      osd mon report interval max) until osd1 reports the bad news to
      the monitor.
    5) The monitor gets the information about failed osd0 and since
      "mon osd min down reporters" is 1, this single osd is sufficent
      for the monitor to believe the bad news that osd0 is unresponsive.
    6) But since "mon osd min down reports" is 3, all the stuff up
      until now has to happen 3 times in a row until the monitor finally
      realizes osd0 is *really* unresponsive.

    
    7) After another 900 seconds  (mon osd report timeout) of waiting
      in hope of another news that osd0 is still/back alive, the monitor
      marks osd0 as down
    8) After another 300 seconds (mon osd down out interval) the
      monitor marks osd0 as down+out
    

    So, after my possibly very naive understanding, it takes
      3*(6+20+120) + 900 + 300 seconds from the event "someone
      accidentally hit the power off switch" to "osd0 is marked
      down+out". 

    
    Correct? I expect not. Which config variables did I
      misunderstand?
    

    Thank you
    Ranjan

    
    Am 29.09.2016 um 20:48 schrieb LOPEZ
      Jean-Charles:

    
      mon_osd_min_down_reporters by default set to 2
      

      I guess you’ll have to set it to 1 in your case
      

      JC

        
              On Sep 29, 2016, at 08:16, Gregory Farnum
                <gfarnum@xxxxxxxxxx>
                wrote:
              

              I think the problem is that Ceph requires a
                certain number of OSDs or a certain number of reports of
                failure before it marks an OSD down. These thresholds
                are not tuned for a 2-OSD cluster; you probably want to
                set them to 1.
                Also keep in mind that the OSDs provide a
                  grace period of 20-30 seconds before they'll report
                  somebody down; this helps prevent spurious recovery
                  but means you will get paused IO on an unclean
                  shutdown.
                

                I can't recall the exact config options
                  off-hand, but it's something like "mon osd min down
                  reports". Search the docs for that. :)
                -Greg

                  
                  On Thursday, September 29, 2016, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx>
                  wrote:

                  On
                    09/29/16 14:07, Ranjan Ghosh wrote:

                    > Wow. Amazing. Thanks a lot!!! This works. 2
                    (hopefully) last questions

                    > on this issue:

                    >

                    > 1) When the first node is coming back up, I can
                    just call "ceph osd up

                    > 0" and Ceph will start auto-repairing
                    everything everything, right?

                    > That is, if there are e.g. new files that were
                    created during the time

                    > the first node was down, they will (sooner or
                    later) get replicated

                    > there?

                    Nope, there is no "ceph osd up <id>"; you just
                    start the osd, and it

                    already gets recognized as up. (if you don't like
                    this, you set it out,

                    not just down; and there is a "ceph osd in
                    <id>" to undo that.)

                    >

                    > 2) If I don't call "osd down" manually (perhaps
                    at the weekend when

                    > I'm not at the office) when a node dies - did I
                    understand correctly

                    > that the "hanging" I experienced is temporary
                    and that after a few

                    > minutes (don't want to try out now) the node
                    should also go down

                    > automatically?

                    I believe so, yes.

                    
                    Also, FYI, RBD images don't seem to have this issue,
                    and work right away

                    on a 3 osd cluster. Maybe cephfs would also work
                    better with a 3rd osd,

                    even an empty one (weight=0). (and I had an
                    unresolved issue testing the

                    same with cephfs on my virtual test cluster)

                    >

                    > BR,

                    > Ranjan

                    >

                    >

                    > Am 29.09.2016 um 13:00 schrieb Peter Maloney:

                    >>

                    >> And also you could try:

                    >>      ceph osd down <osd id>

                    >

                    
                    _______________________________________________

                    ceph-users mailing list

                    ceph-users@xxxxxxxxxxxxxx

                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  
                _______________________________________________

                ceph-users mailing list

                ceph-users@xxxxxxxxxxxxxx

                http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

              
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com