Most times you are better served with simpler settings like osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd, osd_recovery_sleep_hybrid). Using those you can tweak a specific type of OSD that might be having problems during recovery/backfill while allowing the others to continue to backfill at regular speeds.
Additionally you mentioned reweighting OSDs, but it sounded like you do this manually. The balancer module, especially in upmap mode, can be configured quite well to minimize client IO impact while balancing. You can specify times of day that it can move data (only in UTC, it ignores local timezones), a threshold of misplaced data that it will stop moving PGs at, the increment size it will change weights with per operation, how many weights it will adjust with each pass, etc.
On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> wrote:
Thanks - that's a good suggestion!
However I'd still like to know the answers to my 2 questions.
regards
Mark
On 22/10/19 11:22 pm, Paul Emmerich wrote:
> getting rid of filestore solves most latency spike issues during
> recovery because they are often caused by random XFS hangs (splitting
> dirs or just xfs having a bad day)
>
>
> Paul
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com