On Thu, Oct 17, 2019 at 12:35 PM huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx> wrote: > > hello, Robert > > thanks for the quick reply. I did test with osd op queue = wpq , and osd op queue cut off = high > and > osd_recovery_op_priority = 1 > osd recovery delay start = 20 > osd recovery max active = 1 > osd recovery max chunk = 1048576 > osd recovery sleep = 1 > osd recovery sleep hdd = 1 > osd recovery sleep ssd = 1 > osd recovery sleep hybrid = 1 > osd recovery priority = 1 > osd max backfills = 1 > osd backfill scan max = 16 > osd backfill scan min = 4 > osd_op_thread_suicide_timeout = 300 > > But still the ceph cluster showed extremely hug recovery activities during the beginning of the recovery, and after ca. 5-10 minutes, the recovery gradually get under the control. I guess this is quite similar to what you encountered in Nov. 2015. > > It is really annoying, and what else can i do to mitigate this weird inital-recovery issue? any suggestions are much appreciated. Hmm, on our Luminous cluster, we have the defaults other than the op queue and cut off and bringing in a node is nearly zero impact for client traffic. Those would need to be set on all OSDs to be completely effective. Maybe go back to the defaults? ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com