Hello, googling for "ceph wrong node" gives us this insightful thread: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg09960.html I suggest reading through it, more below: On Mon, 29 Feb 2016 15:30:41 +0100 Oliver Dzombic wrote: > Hi, > > i face here some trouble with the cluster. > > Suddenly "random" OSD's are getting marked out. > > After restarting the OSD on the specific node, its working again. > Matches the scenario mentioned above. > This happens usually during activated scrubbing/deep scrubbing. > I guess your cluster is very much overloaded on some level, use atop or similar tools to find out what needs improvement. Also, as always, versions of all SW/kernel, a HW description, output of "ceph -s" etc. will help people identify possible problem spots or to correlate this to other things. Christian > In the logs i can see: > > 2016-02-29 06:08:58.130376 7fd5dae75700 0 -- 10.0.1.2:0/36459 >> > 10.0.0.4:6807/9051245 pipe(0x27488000 sd=58 :60473 s=1 pgs=0 cs=0 l=1 > c=0x28b39440).connect claims to be 10.0.0.4:6807/12051245 not > 10.0.0.4:6807/9051245 - wrong node! > 2016-02-29 06:08:58.130417 7fd5d9961700 0 -- 10.0.1.2:0/36459 >> > 10.0.1.4:6803/6002429 pipe(0x2a6c9000 sd=75 :37736 s=1 pgs=0 cs=0 l=1 > c=0x2420be40).connect claims to be 10.0.1.4:6803/10002429 not > 10.0.1.4:6803/6002429 - wrong node! > 2016-02-29 06:08:58.130918 7fd5b1c17700 0 -- 10.0.1.2:0/36459 >> > 10.0.0.1:6800/8050402 pipe(0x26834000 sd=74 :37605 s=1 pgs=0 cs=0 l=1 > c=0x1f7a9020).connect claims to be 10.0.0.1:6800/9050770 not > 10.0.0.1:6800/8050402 - wrong node! > 2016-02-29 06:08:58.131266 7fd5be141700 0 -- 10.0.1.2:0/36459 >> > 10.0.0.3:6806/9059302 pipe(0x27f07000 sd=76 :48347 s=1 pgs=0 cs=0 l=1 > c=0x2371adc0).connect claims to be 10.0.0.3:6806/11059302 not > 10.0.0.3:6806/9059302 - wrong node! > 2016-02-29 06:08:58.131299 7fd5c1914700 0 -- 10.0.1.2:0/36459 >> > 10.0.1.4:6801/9051245 pipe(0x2d288000 sd=100 :33848 s=1 pgs=0 cs=0 l=1 > c=0x28b37760).connect claims to be 10.0.1.4:6801/12051245 not > 10.0.1.4:6801/9051245 - wrong node! > > and > > 2016-02-29 06:08:59.230754 7fd5c5425700 -1 osd.3 14877 heartbeat_check: > no reply from osd.0 since back 2016-02-29 05:55:26.351951 front > 2016-02-29 05:55:26.351951 (cutoff 2016-02-29 06:08:39.230753) > 2016-02-29 06:08:59.230761 7fd5c5425700 -1 osd.3 14877 heartbeat_check: > no reply from osd.1 since back 2016-02-29 05:41:59.191341 front > 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753) > 2016-02-29 06:08:59.230765 7fd5c5425700 -1 osd.3 14877 heartbeat_check: > no reply from osd.2 since back 2016-02-29 05:41:59.191341 front > 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753) > 2016-02-29 06:08:59.230769 7fd5c5425700 -1 osd.3 14877 heartbeat_check: > no reply from osd.4 since back 2016-02-29 05:55:30.452505 front > 2016-02-29 05:55:30.452505 (cutoff 2016-02-29 06:08:39.230753) > 2016-02-29 06:08:59.230773 7fd5c5425700 -1 osd.3 14877 heartbeat_check: > no reply from osd.7 since back 2016-02-29 05:41:52.790422 front > 2016-02-29 05:41:52.790422 (cutoff 2016-02-29 06:08:39.230753) > > > Any idea what could be the trouble of the cluster ? > > Thank you ! > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com