cluster not coming up after reboot

Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> · Wed, 22 Apr 2015 17:17:26 +0200

Hi,

I changed the cluster network parameter in the config files, restarted 
the monitors , and then restarted all the OSDs (shouldn't have done 
that). Now the OSDS keep on crashing, and the cluster is not able to 
restore.. I eventually rebooted the whole cluster, but the problem 
remains: For a moment all 280 OSDs are up, and then they start crashing 
rapidly until there are only less than 100 left (and eventually 30 or so).

In the log files I see different kind of messages: Some OSDs have:

2015-04-22 17:09:40.019825 7f74a8f70700  0 -- 10.143.16.11:0/4255 >> 
10.141.16.12:6807/2426 pipe(0x54f2000 sd=68 :44692 s=1 pgs=0 cs=0 l=1 
c=0x55c8dc0).connect claims to be 10.141.16.12:6807/1004858 not 
10.141.16.12:6807/2426 - wrong node!
2015-04-22 17:09:40.019827 7f74a694a700  0 -- 10.143.16.11:0/4255 >> 
10.143.16.12:6801/2146 pipe(0x5719000 sd=57 :56935 s=1 pgs=0 cs=0 l=1 
c=0x55ce9e0).connect claims to be 10.143.16.12:6801/1005047 not 
10.143.16.12:6801/2146 - wrong node!
2015-04-22 17:09:40.019867 7f74a9f80700  0 -- 10.143.16.11:0/4255 >> 
10.143.16.12:6803/2228 pipe(0x5722000 sd=60 :36208 s=1 pgs=0 cs=0 l=1 
c=0x55cf640).connect claims to be 10.143.16.12:6803/1005739 not 
10.143.16.12:6803/2228 - wrong node!

Others have:

2015-04-22 17:04:52.125096 7fe99e84e700  0 -- 10.143.16.11:6824/3871 >> 
10.143.16.11:6828/4255 pipe(0x60
4c800 sd=30 :6824 s=2 pgs=14 cs=1 l=0 c=0x5ae27e0).fault with nothing to 
send, going to standby
2015-04-22 17:04:52.126353 7fe98c9ed700  0 -- 10.143.16.11:0/3871 >> 
10.141.16.11:6829/4255 pipe(0x653d8
00 sd=28 :0 s=1 pgs=0 cs=0 l=1 c=0x65c2d60).fault
2015-04-22 17:04:52.126363 7fe990225700  0 -- 10.143.16.11:0/3871 >> 
10.143.16.11:6829/4255 pipe(0x63258
00 sd=21 :0 s=1 pgs=0 cs=0 l=1 c=0x65c3440).fault
2015-04-22 17:04:52.128847 7fe98fb1e700  0 -- 10.143.16.11:6824/3871 >> 
10.143.16.17:6840/1004452 pipe(0
x6518000 sd=62 :6824 s=2 pgs=67 cs=1 l=0 c=0x65da7e0).fault with nothing 
to send, going to standby
2015-04-22 17:05:01.610056 7fe98c8ec700  0 -- 10.143.16.11:0/3871 >> 
10.141.16.11:6823/3641 pipe(0x65420
00 sd=61 :0 s=1 pgs=0 cs=0 l=1 c=0x65c7380).fault
2015-04-22 17:05:01.616051 7fe990a2d700  0 -- 10.143.16.11:0/3871 >> 
10.143.16.11:6823/3641 pipe(0x579c8
00 sd=63 :0 s=1 pgs=0 cs=0 l=1 c=0x65c1a20).fault
2015-04-22 17:05:01.646500 7fe9a515c700  0 log_channel(cluster) log 
[WRN] : map e1993 wrongly marked me
down

I tested the network, the hosts can reach one another on both networks..

Is this somehow fixable?

Many thanks!
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com