Re: Slow performance during recovery operations

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Thu, 02 Apr 2015 19:52:46 +0200



    Hi,

      
      On 04/02/15 19:31, Stillwell, Bryan wrote:

    
        All,
        

        Whenever we're doing some kind of recovery operation on our
          ceph
        clusters (cluster expansion or dealing with a drive
          failure), there
        seems to be a fairly noticable performance drop while it
          does the
        backfills (last time I measured it the performance during
          recovery was
        something like 20% of a healthy cluster).  I'm wondering if
          there are
        any settings that we might be missing which would improve
          this
        situation?
        

        Before doing any kind of expansion operation I make sure
          both 'noscrub'
        and 'nodeep-scrub' are set to make sure scrubing isn't
          making things
        worse.
        

        Also we have the following options set in our ceph.conf:
        

        [osd]
        osd_journal_size = 16384
        osd_max_backfills = 1
        osd_recovery_max_active = 1
        osd_recovery_op_priority = 1
        osd_recovery_max_single_start = 1
        osd_op_threads = 12
        osd_crush_initial_weight = 0
        

        I'm wondering if there might be a way to use ionice in the
          CFQ scheduler
        to delegate the recovery traffic to be of the Idle type so
          customer
        traffic has a higher priority?
      
    
    Recovery creates I/O performance drops in our VM too but it's
    manageable. What really hurts us are deep scrubs.

    Our current situation is Firefly 0.80.9 with a total of 24 identical
    OSDs evenly distributed on 4 servers with the following relevant
    configuration:

    
        osd recovery max active      = 2

        osd scrub load threshold      = 3

        osd deep scrub interval       = 1209600 # 14 days

        osd max backfills             = 4

        osd disk thread ioprio class  = idle

        osd disk thread ioprio priority = 7

    
    we managed to add several OSDs at once while deep scrubs were in
    practice disabled: we just increased deep scrub interval from 1 to 2
    weeks which if I understand correctly had the effect of disabling
    them for 1 week (and indeed there were none while the backfilling
    went on for several hours).

    
    With these settings and no deep-scrubs the load increased a bit in
    the VMs doing non negligible I/Os but this was manageable. Even disk
    thread ioprio settings (which is what you want to get the ionice
    behaviour for deep scrubs) didn't seem to make much of a difference.

    
    Note : I don't believe Ceph will try to scatter the scrubs on the
    whole period you set with deep scrub interval, it seems its
    algorithm is much simpler than that and may lead to temporary salves
    of successive deep scrubs and it might generate some temporary I/O
    load which is hard to diagnose (by default scrubs and deep scrubs
    are logged by the OSD so you can correlate them with whatever you
    use to supervise your cluster).

    
    I actually considered monitoring Ceph for backfills and using ceph
    set nodeep-scrub automatically when there are some and unset it when
    they disappear.

    
    Best regards,

    
    Lionel Bouton

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com