osd fast shutdown provokes slow requests

Manuel Lausch <manuel.lausch@xxxxxxxx> · Thu, 13 Aug 2020 15:46:13 +0200

Hi,

I investigated an other problem with my nautilus 14.2.11 (with
14.2.10 as well) cluster.

If I stop the OSDs on one node (systemctl stop ceph-osd.target, or
shutdown/reboot) it took mostly several seconds until the cluster
detects the OSDs as down and I run in slow requests.

I identified the option "osd_fast_shutdown".
If I configure it to "false" I see in the ceph.log immediately on
shutdown the logs like this:
cluster [INF] osd.837 marked itself down

If the parameter is true (default setting) The cluster needs some time
until I get hundert of lines like this in the ceph.log
cluster [DBG] osd.317 reported immediately failed by osd.202

until all detection and peering is done I got slow requests.

On smaller clusters, with only 48 OSDs on 4 Nodes the down detection
works a lot of faster. But on the big one I need to set this to false
to work like expected. 

I wonder if someone else see this.
I think it is OK to short the shutdown process, but it would be nice,
if the OSDs tell the mon its shutdown itself

Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx