Re: Repair/Rebalance slows down

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 6 Jan 2022 09:17:36 -0600

What's iostat show for the drive in question? What you're seeing is the
cluster rebalancing initially, then at the end, it's probably that single
drive being filled. I'd expect 25-100MB/s to be the fill rate of the newly
added drive with backfills per osd set to 2 or so (much more than that
doesn't help). Check the disk utilization for the newly added OSD at the
tail end, and you'll probably see it IOPS saturated.

On Thu, Jan 6, 2022 at 8:09 AM Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
wrote:

> Hi Everyone!
>
> I have a 16 node, 640 OSD (5 to 1 SSD) bluestore cluster which is mainly
> used for RGW services. It has its own backend cluster network for IO
> separate from the customer network.
>
> Whenever we add or remove an OSD the rebalance or repair IO starts off
> very fast 4GB/s+ but it will continually slow down over a week and by then
> end it's moving at KB/s. So each 16TB OSD takes a week+ to repair or
> rebalance! I have not been able to identify any bottleneck or slow point,
> it just seems to be Ceph taking longer to do its thing.
>
> Are there any settings I can check or change to get the repair speed to
> maintain a high level to completion? If we could stay in the GB/s speed we
> should be able to repair in a couple days, not a week or more...
>
> Thank you,
> Ray
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx