Repair/Rebalance slows down

Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> · Thu, 6 Jan 2022 14:09:26 +0000

Hi Everyone!

I have a 16 node, 640 OSD (5 to 1 SSD) bluestore cluster which is mainly used for RGW services. It has its own backend cluster network for IO separate from the customer network.

Whenever we add or remove an OSD the rebalance or repair IO starts off very fast 4GB/s+ but it will continually slow down over a week and by then end it's moving at KB/s. So each 16TB OSD takes a week+ to repair or rebalance! I have not been able to identify any bottleneck or slow point, it just seems to be Ceph taking longer to do its thing.

Are there any settings I can check or change to get the repair speed to maintain a high level to completion? If we could stay in the GB/s speed we should be able to repair in a couple days, not a week or more...

Thank you,
Ray

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx