One host failure bring down the whole cluster

Kai KH Huang <huangkai2@xxxxxxxxxx> · Tue, 31 Mar 2015 02:42:27 +0000

Hi, all
    I have a two-node Ceph cluster, and both are monitor and osd. When they're both up, osd are all up and in, everything is fine... almost:

[root~]# ceph -s

     health HEALTH_WARN 25 pgs degraded; 316 pgs incomplete; 85 pgs stale; 24 pgs stuck degraded; 316 pgs stuck inactive; 85 pgs stuck stale; 343 pgs stuck unclean; 24 pgs stuck undersized; 25 pgs undersized; recovery 11/153 objects degraded (7.190%)
     monmap e1: 2 mons at {server_b=10.???.78:6789/0,server_a=10.???.80:6789/0}, election epoch 14, quorum 0,1 server_b,server_a
     osdmap e116375: 22 osds: 22 up, 22 in
      pgmap v238656: 576 pgs, 2 pools, 224 MB data, 59 objects
            56175 MB used, 63420 GB / 63475 GB avail
            11/153 objects degraded (7.190%)
                  15 active+undersized+degraded
                  75 stale+active+clean
                   2 active+remapped
                 158 active+clean
                  10 stale+active+undersized+degraded
                 316 incomplete

But if I bring down one server, the whole cluster seems not functioning any more:

[root~]# ceph -s
2015-03-31 10:32:43.848125 7f57e4105700  0 -- :/1017540 >> 10.???.78:6789/0 pipe(0x7f57e0027120 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f57e00273b0).fault

This should not happen...Any thoughts?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com