Requests blocked as cluster is unaware of dead OSDs for quite a long time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have three datacenters with three storage hosts in each, which house one OSD/MON per host. There are three replicas, one in each datacenter. I want the cluster to be able to survive a nuke dropped on 1/3 datacenters, scaling up to 2/5 datacenters. I do not need realtime data replication (Ceph is already fast enough), but I do need decently realtime fault tolerance such that requests are blocked for ideally less than 10 seconds.

In testing, I kill networking on 3 hosts and the cluster becomes unresponsive for 1-5 minutes as requests are blocked. The monitors are detected as down within 15-20 seconds, but OSD take a long time to change state to 'down'.

I have played with these timeout and heartbeat options but they don't seem to have any effect:
[osd]
osd_heartbeat=3
osd_heartbeat_grace=9
osd_mon_heartbeat_interval=3
osd_mon_report_interval_min=3
osd_mon_report_interval_max=9
osd_mon_ack_timeout=9

Is it the nature of the networking failure? I can pkill ceph-osd to simulate a software failure and they are detected as down almost instantly.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux