On Thu, Aug 27, 2020 at 05:56:22PM +0000, DHilsbos@xxxxxxxxxxxxxx wrote: > 2) Adjust performance settings to allow the data movement to go faster. Again, I don't have those setting immediately to hand, but Googling something like 'ceph recovery tuning,' or searching this list, should point you in the right direction. Notice that you only have 6 PGs trying to move at a time, with 2 blocked on your near-full OSDs (8 & 19). I believe; by default, each OSD daemon is only involved in 1 data movement at a time. The tradeoff here is user activity suffers if you adjust to favor recovery, however, with the cluster in ERROR status, I suspect user activity is already suffering. We've set osd_max_backfills to 16 in the config and when necessary we manually change the runtime value of osd_recovery_sleep_hdd. It defaults to 0.1 seconds of wait time between objects (I think?). If you really want fast recovery try this additional change: ceph tell osd.\* config set osd_recovery_sleep_hdd 0 Be warned though, this will seriously affect client performance. Then again it can bump your recovery speed by multiple orders of magnitude. If you want to go back to how things were, set it back to 0.1 instead of 0. It may take a couple of seconds (maybe a minute) until performance for clients starts to improve. I guess the OSDs are too busy with recovery to instantly accept the changed value. Florian
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx