The last two days we've experienced a couple short outages shortly after setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters (~2,200 OSDs). This cluster is running Nautilus (14.2.6) and setting/unsetting these flags has been done many times in the past without a problem. One thing I've noticed is that on both days right after setting 'noscrub' or 'nodeep-scrub' that a do_prune message shows up in the monitor logs followed by a timeout. About 30 seconds later we start seeing OSDs getting marked down: 2020-06-03 08:06:53.914 7fcc3ed57700 0 mon.p3cephmon004@0(leader) e11 handle_command mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1 2020-06-03 08:06:53.914 7fcc3ed57700 0 log_channel(audit) log [INF] : from='client.5773023471 10.2.128.8:0/523139029' entity='client.admin' cmd=[{"prefix": "osd set", "key": "noscrub"}]: dispatch 2020-06-03 08:06:54.231 7fcc4155c700 1 mon.p3lcephmon004@0(leader).osd e1535232 do_prune osdmap full prune enabled 2020-06-03 08:06:54.318 7fcc3f558700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3f558700' had timed out after 0 2020-06-03 08:06:54.319 7fcc4055a700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc4055a700' had timed out after 0 2020-06-03 08:06:54.319 7fcc40d5b700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc40d5b700' had timed out after 0 2020-06-03 08:06:54.319 7fcc3fd59700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3fd59700' had timed out after 0 ... 2020-06-03 08:07:16.049 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1165 is reporting failure:1 2020-06-03 08:07:16.049 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1165 2020-06-03 08:07:16.304 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.127 is reporting failure:1 2020-06-03 08:07:16.304 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.127 2020-06-03 08:07:16.693 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1455 is reporting failure:1 2020-06-03 08:07:16.693 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1455 2020-06-03 08:07:16.695 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 we have enough reporters to mark osd.736 down 2020-06-03 08:07:16.696 7fcc3ed57700 0 log_channel(cluster) log [INF] : osd.736 failed (root=default,rack=S06-06,chassis=S06-06-17,host=p3cephosd386) (3 reporters from different host after 20.389591 >= grace 20.025280) 2020-06-03 08:07:16.696 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1455 is reporting failure:1 2020-06-03 08:07:16.696 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1455 2020-06-03 08:07:16.758 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.2108 is reporting failure:1 2020-06-03 08:07:16.758 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.2108 2020-06-03 08:07:16.800 7fcc3ed57700 1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1166 is reporting failure:1 2020-06-03 08:07:16.800 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1166 2020-06-03 08:07:16.835 7fcc4155c700 1 mon.p3cephmon004@0(leader).osd e1535234 do_prune osdmap full prune enabled ... Does any one know why setting the no scrubbing flags would cause such an issue? Or if this is a known issue with a fix in 14.2.9 or 14.2.10 (when it comes out)? Thanks, Bryan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx