Re: Slow performance during recovery operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 04/02/15 19:31, Stillwell, Bryan wrote:
All,

Whenever we're doing some kind of recovery operation on our ceph
clusters (cluster expansion or dealing with a drive failure), there
seems to be a fairly noticable performance drop while it does the
backfills (last time I measured it the performance during recovery was
something like 20% of a healthy cluster).  I'm wondering if there are
any settings that we might be missing which would improve this
situation?

Before doing any kind of expansion operation I make sure both 'noscrub'
and 'nodeep-scrub' are set to make sure scrubing isn't making things
worse.

Also we have the following options set in our ceph.conf:

[osd]
osd_journal_size = 16384
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
osd_recovery_max_single_start = 1
osd_op_threads = 12
osd_crush_initial_weight = 0


I'm wondering if there might be a way to use ionice in the CFQ scheduler
to delegate the recovery traffic to be of the Idle type so customer
traffic has a higher priority?

Recovery creates I/O performance drops in our VM too but it's manageable. What really hurts us are deep scrubs.
Our current situation is Firefly 0.80.9 with a total of 24 identical OSDs evenly distributed on 4 servers with the following relevant configuration:

    osd recovery max active      = 2
    osd scrub load threshold      = 3
    osd deep scrub interval       = 1209600 # 14 days
    osd max backfills             = 4
    osd disk thread ioprio class  = idle
    osd disk thread ioprio priority = 7

we managed to add several OSDs at once while deep scrubs were in practice disabled: we just increased deep scrub interval from 1 to 2 weeks which if I understand correctly had the effect of disabling them for 1 week (and indeed there were none while the backfilling went on for several hours).

With these settings and no deep-scrubs the load increased a bit in the VMs doing non negligible I/Os but this was manageable. Even disk thread ioprio settings (which is what you want to get the ionice behaviour for deep scrubs) didn't seem to make much of a difference.

Note : I don't believe Ceph will try to scatter the scrubs on the whole period you set with deep scrub interval, it seems its algorithm is much simpler than that and may lead to temporary salves of successive deep scrubs and it might generate some temporary I/O load which is hard to diagnose (by default scrubs and deep scrubs are logged by the OSD so you can correlate them with whatever you use to supervise your cluster).

I actually considered monitoring Ceph for backfills and using ceph set nodeep-scrub automatically when there are some and unset it when they disappear.

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux