Multiple outages when disabling scrubbing

Bryan Stillwell <bstillwell@xxxxxxxxxxx> · Wed, 3 Jun 2020 19:30:02 +0000

The last two days we've experienced a couple short outages shortly after
setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters
(~2,200 OSDs).  This cluster is running Nautilus (14.2.6) and setting/unsetting
these flags has been done many times in the past without a problem.

One thing I've noticed is that on both days right after setting 'noscrub' or
'nodeep-scrub' that a do_prune message shows up in the monitor logs followed by
a timeout.  About 30 seconds later we start seeing OSDs getting marked down:

2020-06-03 08:06:53.914 7fcc3ed57700  0 mon.p3cephmon004@0(leader) e11 handle_command mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1
2020-06-03 08:06:53.914 7fcc3ed57700  0 log_channel(audit) log [INF] : from='client.5773023471 10.2.128.8:0/523139029' entity='client.admin' cmd=[{"prefix": "osd set", "key": "noscrub"}]: dispatch
2020-06-03 08:06:54.231 7fcc4155c700  1 mon.p3lcephmon004@0(leader).osd e1535232 do_prune osdmap full prune enabled
2020-06-03 08:06:54.318 7fcc3f558700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3f558700' had timed out after 0
2020-06-03 08:06:54.319 7fcc4055a700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc4055a700' had timed out after 0
2020-06-03 08:06:54.319 7fcc40d5b700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc40d5b700' had timed out after 0
2020-06-03 08:06:54.319 7fcc3fd59700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3fd59700' had timed out after 0
...
2020-06-03 08:07:16.049 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1165 is reporting failure:1
2020-06-03 08:07:16.049 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1165
2020-06-03 08:07:16.304 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.127 is reporting failure:1
2020-06-03 08:07:16.304 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.127
2020-06-03 08:07:16.693 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1455 is reporting failure:1
2020-06-03 08:07:16.693 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1455
2020-06-03 08:07:16.695 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234  we have enough reporters to mark osd.736 down
2020-06-03 08:07:16.696 7fcc3ed57700  0 log_channel(cluster) log [INF] : osd.736 failed (root=default,rack=S06-06,chassis=S06-06-17,host=p3cephosd386) (3 reporters from different host after 20.389591 >= grace 20.025280)
2020-06-03 08:07:16.696 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1455 is reporting failure:1
2020-06-03 08:07:16.696 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1455
2020-06-03 08:07:16.758 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.2108 is reporting failure:1
2020-06-03 08:07:16.758 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.2108
2020-06-03 08:07:16.800 7fcc3ed57700  1 mon.p3cephmon004@0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1166 is reporting failure:1
2020-06-03 08:07:16.800 7fcc3ed57700  0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1166
2020-06-03 08:07:16.835 7fcc4155c700  1 mon.p3cephmon004@0(leader).osd e1535234 do_prune osdmap full prune enabled
...

Does any one know why setting the no scrubbing flags would cause such an issue?
Or if this is a known issue with a fix in 14.2.9 or 14.2.10 (when it comes out)?

Thanks,
Bryan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx