Hi all, New Quincy cluster here that I'm just running through some benchmarks against: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) 11 nodes of 24x 18TB HDD OSDs, 2x 2.9TB SSD OSDs I'm seeing a delay of almost exactly 10 minutes when I remove an OSD/node from the cluster until actual recovery IO begins. This is much different behaviour that what I'm used to in Nautilus previously, where recovery IO would commence within seconds. Downed OSDs are reflected in ceph health within a few seconds (as expected), and affected PGs show as undersized a few seconds later (as expected). I guess this 10-minute delay may even be a feature-- accidentally rebooting a node before setting recovery flags would prevent rebalancing, for example. Just thought it was worth asking in case it's a bug or something to look deeper into. I've read through the OSD config and all of my recovery tuneables look ok, for example: https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/; [ceph: root@ /]# ceph config get osd osd_recovery_delay_start 20.000000 3[ceph: root@ /]# ceph config get osd osd_recovery_sleep 40.000000 5[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hdd 60.100000 7[ceph: root@ /]# ceph config get osd osd_recovery_sleep_ssd 80.000000 9[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hybrid 100.025000 Thanks in advance. Ngā mihi, Sean Matheny HPC Cloud Platform DevOps Lead New Zealand eScience Infrastructure (NeSI) e: sean.matheny@xxxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx