OSD flapping during recovery

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 17 Feb 2014 11:24:42 -0800



    I had some issues with OSD flapping after 2 days of recovery.  It
    appears to be related to swapping, even though I have plenty of RAM
    for the number of OSDs I have.  The cluster was completely unusable,
    and I ended up rebooting all the nodes.  It's been great ever since,
    but I'm assuming it will happen again.

    
    Details are below, but I'm wondering if anybody has any idea what
    happened?

    
    I noticed some lumpy data distribution on my OSDs.  Following the
    advice on the mailling list, I increased the pg_num and pgp_num to
    the values from the formula.  .rgw.buckets is the only large pool,
    so I increased pg_num and pgp_num from 128 to 2048 on that one
    pool.  Cluster status changes to HEALTH_WARN, there were 1920 PGs
    with state active+remapped+wait_backfill, and 32% of the objects
    were degraded.

    
    Recovery was slow, and we were having some performance issues.  I
    lowered
    
    osd_max_backfills from 10 to 2, and
    
    osd_recovery_op_priority from 10 to 2.  This didn't slow the
    recovery down much, but made my application much more responsive. 
    My journals are on the OSD disks (no SSDs).  I believe the
    osd_max_backfills was the more important change, but it's much
    slower to test than the osd_recovery_op_priority change.  Aside from
    those two, my notes say I changed and reverted
    
    osd_disk_threads, osd_op_threads, osd_recovery_threads.  All changes
    were pushed out using ceph --admin-daemon
      /var/run/ceph/ceph-osd.0.asok config set osd_max_backfills 2

    
    I watched the cluster on and off over the weekend.  Ceph was
    steadily recovering.  It was down to ~900 PGs in
    active+remapped+wait_backfill, with 17%  of objects degraded.  A few
    OSDs have been marked down and recovered, so a few tens of PGs are
    in state active+degraded+remapped+wait_backfill and
    active+degraded+remapped+backfilling.  I was poking around, and I
    noticed kswapd was using betwen 5% and 30% CPU on all nodes.  It was
    bursty, peaking at 30% CPU usage for about 5sec out of every 30sec. 
    Swap usage wasn't increasing, and kswapd appeared to be doing a lot
    of nothing.  My machines have 8 OSDs, and 36GB of RAM.  top said
    that all machines were caching 30GB of data.  The 8 ceph-osd daemons
    are using 0.5GB to 1.2GB of RAM.  I don't have the exact numbers,
    but I believe they were using about 5GB for all 8 ceph-osd daemons.

    
    A few hours later, and the OSDs really started flapping.  They're
    being voted unresponsive and marked down faster than they can
    rejoin.  At one point, a third of the OSDs were marked down.  ceph
    -w is complaining about hundreds of slow requests greater than 900
    seconds.  Most RGW accesses are failing with HTTP timeouts.  kswapd
    is using a consistent 33% CPU on all nodes, with no variance that I
    can see.  To add insult, the cluster was running a scrub and a deep
    scrub.

    
    I eventually rebooted all nodes in the cluster, one at a time.  Once
    quorum reestablished, recovery proceeded at the original speed.  The
    OSDs are responding, and all my RGW requests are returning in a
    reasonable amount of time.  There are no complaints of slow requests
    in ceph -w.  kswapd is using 0% of the CPU.

    
    I'm running Ceph 0.72.2 on Ubuntu 12.04.4, with kernel
    3.5.0-37-generic #58~precise1-Ubuntu SMP.

    
    I monitor the running version as well as the installed version, so I
    know that all daemons were restarted after the 0.72.1 -> 0.72.2
    upgrade.  That happened on Jan 22nd.

    
    Any idea what happened?  I'm assuming it will happen again if
    recovery takes long enough.

    
    -- 

       
            Craig Lewis
            

             Senior Systems Engineer

              Office +1.714.602.1309

              Email clewis@xxxxxxxxxxxxxxxxxx
             
            Central Desktop.
                Work together in ways you never thought possible.
               

                 Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com