Re: Ceph monitors overloaded on large cluster restart

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 19 Dec 2018 20:02:32 -0500



    Hi Dan,

    
    'noup' now makes a lot of sense - that's probably the major help
    that our cluster start would have needed.  Essentially this way only
    one map change occurs in the cluster when all the OSDs are marked
    'in' and that gets distributed, vs hundreds or thousands of map
    changes as various OSDs boot at slightly different times.  I have a
    smaller cluster that I can test it with and measure how the network
    traffic changes to the mons, and will plan this in the next
    shutdown/restart/upgrade.

    
    Thanks for the quick response and the tip!

    
    Andras
    

    On 12/19/18 6:47 PM, Dan van der Ster
      wrote:

    
      Hey Andras,
        

        Three mons is possibly too few for such a large cluster.
          We've had lots of good stable experience with 5-mon clusters.
          I've never tried 7, so I can't say if that would lead to other
          problems (e.g. leader/peon sync scalability).
        

        That said, our 10000-osd bigbang tests managed with only 3
          mons, and I assume that outside of this full system reboot
          scenario your 3 cope well enough. You should probably add 2
          more, but I wouldn't expect that alone to solve this problem
          in the future.
        

        Instead, with a slightly tuned procedure and a bit of osd
          log grepping, I think you could've booted this cluster more
          quickly than 4 hours with those mere 3 mons.

        
        As you know, each osds boot process requires the
          downloading of all known osdmaps. If all osds are booting
          together, and the mons are saturated, the osds can become
          sluggish when responding to their peers, which could lead to
          the flapping scenario you saw. Flapping leads to new osdmap
          epochs that then need to be distributed, worsening the issue.
          It's good that you used nodown and noout, because without
          these the boot time would've been even longer. Next time also
          set noup and noin to further reduce the osdmap churn.
        

        One other thing: there's a debug_osd level -- 10 or 20, I
          forget exactly -- that you can set to watch the maps sync up
          on each osd. Grep the osd logs for some variations on "map"
          and "epoch".
        

        In short, here's what I would've done:
        

        0. boot the mons, waiting until they have a full quorum.
        1. set nodown, noup, noin, noout   <-- with these, there
          should be zero new osdmaps generated while the osds boot.
        2. start booting osds. set the necessary debug_osd level to
          see the osdmap sync progress in the ceph-osd logs.
        3. if the mons are over saturated, boot progressively --
          one rack at a time, for example.
        4. once all osds have caught up to the current osdmap,
          unset noup. The osds should then all "boot" (as far as the
          mons are concerned) and be marked up. (this might be sluggish
          on a 3400 osd cluster, perhaps taking a few 10s of seconds).
          the pgs should be active+clean at this point.
        5. unset nodown, noin, noout. which should change nothing
          provided all went well.
        

        Hope that helps for next time!

          
          Dan
        

        On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki
          <apataki@xxxxxxxxxxxxxxxxxxxxx>
          wrote:

        
        Forgot
          to mention: all nodes are on Luminous 12.2.8 currently on
          CentOS 7.5.

          
          On 12/19/18 5:34 PM, Andras Pataki wrote:

          > Dear ceph users,

          >

          > We have a large-ish ceph cluster with about 3500 osds. 
          We run 3 mons 

          > on dedicated hosts, and the mons typically use a few
          percent of a 

          > core, and generate about 50Mbits/sec network traffic. 
          They are 

          > connected at 20Gbits/sec (bonded dual 10Gbit) and are
          running on 2x14 

          > core servers.

          >

          > We recently had to shut ceph down completely for
          maintenance (which we 

          > rarely do), and had significant difficulties starting it
          up.  The 

          > symptoms included OSDs hanging on startup, being marked
          down, flapping 

          > and all that bad stuff.  After some investigation we
          found that the 

          > 20Gbit/sec network interfaces of the monitors were
          completely 

          > saturated as the OSDs were starting, while the monitor
          processes were 

          > using about 3 cores (300% CPU).  We ended up having to
          start the OSDs 

          > up super slow to make sure that the monitors could keep
          up - it took 

          > about 4 hours to start 3500 OSDs (at a rate about 4
          seconds per OSD).  

          > We've tried setting noout and nodown, but that didn't
          really help 

          > either.  A few questions that would be good to understand
          in order to 

          > move to a better configuration.

          >

          > 1. How does the monitor traffic scale with the number of
          OSDs? 

          > Presumably the traffic comes from distributing cluster
          maps as the 

          > cluster changes on OSD starts.  The cluster map is
          perhaps O(N) for N 

          > OSDs, and each OSD needs an update on a cluster change so
          that would 

          > make one change an O(N^2) traffic.  As OSDs start, the
          cluster changes 

          > quite a lot (N times?), so would that make the startup
          traffic 

          > O(N^3)?  If so, that sounds pretty scary for scalability.

          >

          > 2. Would adding more monitors help here?  I.e. presumably
          each OSD 

          > gets its maps from one monitor, so they would share the
          traffic. Would 

          > the inter-monitor communication/elections/etc. be
          problematic for more 

          > monitors (5, 7 or even more)?  Would more monitors be
          recommended?  If 

          > so, how many is practical?

          >

          > 3. Are there any config parameters useful for tuning the
          traffic 

          > (perhaps send mon updates less frequently, or something
          along those 

          > lines)?

          >

          > Any other advice on this topic would also be helpful.

          >

          > Thanks,

          >

          > Andras

          >

          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com