Re: OSDs down

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 17 Nov 2014 13:07:08 -0800

Firstly, any chance of getting node4 and node5 back up?  You can move the disks (monitor and osd) to a new chasis, and bring it back up.  As long as it has the same IP as the original node4 and node5, the monitor should join.

How much is the clock skewed on node2?  I haven't had problems with small skew (~100 ms), but I've seen posts to the mailing list about large skews (minutes) causing quorum and authentication problems.

When you say "Nevertheless on node3 every ceph * commands stay freezed", do you by chance mean node2 instead of node3?  If so, that supports the clock skew being a problem, preventing the commands and the OSDs from authenticating with the monitors.

If you really did mean node3, then something strange else going on.

On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane <stephane.neveu@xxxxxxxxxxxxxxx> wrote:
Hi all J ,

I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2, node3
On my first node leaving, I’ve updated the crush map to remove every osds running on those 2 lost servers :
Ceph osd crush remove osds && ceph auth del osds && ceph osd rm osds && ceph osd remove my2Lostnodes
So the crush map seems to be ok now on node1.
Ceph osd tree on node 1 returns that every osds running on node2 are “down 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every ceph * commands stay freezed, so I’m not sure the crush map has been updated on node2 and node3. I don’t know how to set ods on node 2 up again.
My node2 says it cannot connect to the cluster !

Ceph –s on node 1 gives me (so still 5 monitors):

    cluster 45d9195b-365e-491a-8853-34b46553db94
     health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean; recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down; noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew detected on mon.node2
     monmap e1: 5 mons at {node1=172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0}, election epoch 488, quorum 0,1,2 node1,node2,node3
     mdsmap e48: 1/1/1 up {0=node3=up:active}
     osdmap e3852: 33 osds: 22 up, 33 in
            flags noout
      pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects
            2122 GB used, 90051 GB / 92174 GB avail
            181055/544038 objects degraded (33.280%)
               10016 active+degraded
  client io 0 B/s rd, 233 kB/s wr, 22 op/s

Thx for your help !!

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com