Hi,
consider the following scenario:
- cluster with public and cluster networks
- three node cluster
- 5 osd per node
- 1 mon per node
- two node attached at the same 10GB switch - cluster network (room A)
- one node attached to another 10GB switch - cluster network (room B)
- no redundancy between 10GB switches cluster network
- redundant public network (1GB)
Cause:
the 10GB switch (cluster network) in room A turns off (maintenance/power loss etc)
Problem:
only 4 of 5 osd declared down on the second node, 5 of 5 osd declared up on the first node.
I/O on the clients stuck until manually turns off osd on first node.
This is our ceph.conf configuration:
...
public network = 10.x.x.x/24
cluster network = 172.x.x.x/24
...
mon osd report timeout = 15
mon osd down out interval = 600
...
the doc says:
If you declare a cluster network, OSDs will route heartbeat, object replication and recovery traffic over the cluster network. This may improve performance compared to using a single network. To configure a cluster network, add the following option to the
[global]
section of your Ceph configuration file. So, why ceph was not able to automatically turn off the isolated osd?
Lorenzo
--
tel: 0522 3993772 - 335 8416054
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com