cluster not coming up after reboot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I changed the cluster network parameter in the config files, restarted the monitors , and then restarted all the OSDs (shouldn't have done that). Now the OSDS keep on crashing, and the cluster is not able to restore.. I eventually rebooted the whole cluster, but the problem remains: For a moment all 280 OSDs are up, and then they start crashing rapidly until there are only less than 100 left (and eventually 30 or so).

In the log files I see different kind of messages: Some OSDs have:

2015-04-22 17:09:40.019825 7f74a8f70700 0 -- 10.143.16.11:0/4255 >> 10.141.16.12:6807/2426 pipe(0x54f2000 sd=68 :44692 s=1 pgs=0 cs=0 l=1 c=0x55c8dc0).connect claims to be 10.141.16.12:6807/1004858 not 10.141.16.12:6807/2426 - wrong node! 2015-04-22 17:09:40.019827 7f74a694a700 0 -- 10.143.16.11:0/4255 >> 10.143.16.12:6801/2146 pipe(0x5719000 sd=57 :56935 s=1 pgs=0 cs=0 l=1 c=0x55ce9e0).connect claims to be 10.143.16.12:6801/1005047 not 10.143.16.12:6801/2146 - wrong node! 2015-04-22 17:09:40.019867 7f74a9f80700 0 -- 10.143.16.11:0/4255 >> 10.143.16.12:6803/2228 pipe(0x5722000 sd=60 :36208 s=1 pgs=0 cs=0 l=1 c=0x55cf640).connect claims to be 10.143.16.12:6803/1005739 not 10.143.16.12:6803/2228 - wrong node!

Others have:

2015-04-22 17:04:52.125096 7fe99e84e700 0 -- 10.143.16.11:6824/3871 >> 10.143.16.11:6828/4255 pipe(0x60 4c800 sd=30 :6824 s=2 pgs=14 cs=1 l=0 c=0x5ae27e0).fault with nothing to send, going to standby 2015-04-22 17:04:52.126353 7fe98c9ed700 0 -- 10.143.16.11:0/3871 >> 10.141.16.11:6829/4255 pipe(0x653d8
00 sd=28 :0 s=1 pgs=0 cs=0 l=1 c=0x65c2d60).fault
2015-04-22 17:04:52.126363 7fe990225700 0 -- 10.143.16.11:0/3871 >> 10.143.16.11:6829/4255 pipe(0x63258
00 sd=21 :0 s=1 pgs=0 cs=0 l=1 c=0x65c3440).fault
2015-04-22 17:04:52.128847 7fe98fb1e700 0 -- 10.143.16.11:6824/3871 >> 10.143.16.17:6840/1004452 pipe(0 x6518000 sd=62 :6824 s=2 pgs=67 cs=1 l=0 c=0x65da7e0).fault with nothing to send, going to standby 2015-04-22 17:05:01.610056 7fe98c8ec700 0 -- 10.143.16.11:0/3871 >> 10.141.16.11:6823/3641 pipe(0x65420
00 sd=61 :0 s=1 pgs=0 cs=0 l=1 c=0x65c7380).fault
2015-04-22 17:05:01.616051 7fe990a2d700 0 -- 10.143.16.11:0/3871 >> 10.143.16.11:6823/3641 pipe(0x579c8
00 sd=63 :0 s=1 pgs=0 cs=0 l=1 c=0x65c1a20).fault
2015-04-22 17:05:01.646500 7fe9a515c700 0 log_channel(cluster) log [WRN] : map e1993 wrongly marked me
down

I tested the network, the hosts can reach one another on both networks..

Is this somehow fixable?

Many thanks!
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux