Requests blocked as cluster is unaware of dead OSDs for quite a long time

Jared H <programmerjared@xxxxxxxxx> · Mon, 26 Mar 2018 17:58:58 -0500

I have three datacenters with three storage hosts in each, which house one OSD/MON per host. There are three replicas, one in each datacenter. I want the cluster to be able to survive a nuke dropped on 1/3 datacenters, scaling up to 2/5 datacenters. I do not need realtime data replication (Ceph is already fast enough), but I do need decently realtime fault tolerance such that requests are blocked for ideally less than 10 seconds.
In testing, I kill networking on 3 hosts and the cluster becomes unresponsive for 1-5 minutes as requests are blocked. The monitors are detected as down within 15-20 seconds, but OSD take a long time to change state to 'down'.

I have played with these timeout and heartbeat options but they don't seem to have any effect:
[osd]
osd_heartbeat=3
osd_heartbeat_grace=9
osd_mon_heartbeat_interval=3
osd_mon_report_interval_min=3
osd_mon_report_interval_max=9
osd_mon_ack_timeout=9

Is it the nature of the networking failure? I can pkill ceph-osd to simulate a software failure and they are detected as down almost instantly.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com