Re: osd suddenly down / connect claims to be / heartbeat_check: no reply

Christian Balzer <chibi@xxxxxxx> · Tue, 1 Mar 2016 11:13:08 +0900

Hello,

googling for "ceph wrong node" gives us this insightful thread:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg09960.html

I suggest reading through it, more below:

On Mon, 29 Feb 2016 15:30:41 +0100 Oliver Dzombic wrote:

> Hi,
> 
> i face here some trouble with the cluster.
> 
> Suddenly "random" OSD's are getting marked out.
> 
> After restarting the OSD on the specific node, its working again.
> 
Matches the scenario mentioned above.

> This happens usually during activated scrubbing/deep scrubbing.
>
I guess your cluster is very much overloaded on some level, use atop or
similar tools to find out what needs improvement.

Also, as always, versions of all SW/kernel, a HW description, output of
"ceph -s" etc. will help people identify possible problem spots or to
correlate this to other things.

Christian

> In the logs i can see:
> 
> 2016-02-29 06:08:58.130376 7fd5dae75700  0 -- 10.0.1.2:0/36459 >>
> 10.0.0.4:6807/9051245 pipe(0x27488000 sd=58 :60473 s=1 pgs=0 cs=0 l=1
> c=0x28b39440).connect claims to be 10.0.0.4:6807/12051245 not
> 10.0.0.4:6807/9051245 - wrong node!
> 2016-02-29 06:08:58.130417 7fd5d9961700  0 -- 10.0.1.2:0/36459 >>
> 10.0.1.4:6803/6002429 pipe(0x2a6c9000 sd=75 :37736 s=1 pgs=0 cs=0 l=1
> c=0x2420be40).connect claims to be 10.0.1.4:6803/10002429 not
> 10.0.1.4:6803/6002429 - wrong node!
> 2016-02-29 06:08:58.130918 7fd5b1c17700  0 -- 10.0.1.2:0/36459 >>
> 10.0.0.1:6800/8050402 pipe(0x26834000 sd=74 :37605 s=1 pgs=0 cs=0 l=1
> c=0x1f7a9020).connect claims to be 10.0.0.1:6800/9050770 not
> 10.0.0.1:6800/8050402 - wrong node!
> 2016-02-29 06:08:58.131266 7fd5be141700  0 -- 10.0.1.2:0/36459 >>
> 10.0.0.3:6806/9059302 pipe(0x27f07000 sd=76 :48347 s=1 pgs=0 cs=0 l=1
> c=0x2371adc0).connect claims to be 10.0.0.3:6806/11059302 not
> 10.0.0.3:6806/9059302 - wrong node!
> 2016-02-29 06:08:58.131299 7fd5c1914700  0 -- 10.0.1.2:0/36459 >>
> 10.0.1.4:6801/9051245 pipe(0x2d288000 sd=100 :33848 s=1 pgs=0 cs=0 l=1
> c=0x28b37760).connect claims to be 10.0.1.4:6801/12051245 not
> 10.0.1.4:6801/9051245 - wrong node!
> 
> and
> 
> 2016-02-29 06:08:59.230754 7fd5c5425700 -1 osd.3 14877 heartbeat_check:
> no reply from osd.0 since back 2016-02-29 05:55:26.351951 front
> 2016-02-29 05:55:26.351951 (cutoff 2016-02-29 06:08:39.230753)
> 2016-02-29 06:08:59.230761 7fd5c5425700 -1 osd.3 14877 heartbeat_check:
> no reply from osd.1 since back 2016-02-29 05:41:59.191341 front
> 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753)
> 2016-02-29 06:08:59.230765 7fd5c5425700 -1 osd.3 14877 heartbeat_check:
> no reply from osd.2 since back 2016-02-29 05:41:59.191341 front
> 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753)
> 2016-02-29 06:08:59.230769 7fd5c5425700 -1 osd.3 14877 heartbeat_check:
> no reply from osd.4 since back 2016-02-29 05:55:30.452505 front
> 2016-02-29 05:55:30.452505 (cutoff 2016-02-29 06:08:39.230753)
> 2016-02-29 06:08:59.230773 7fd5c5425700 -1 osd.3 14877 heartbeat_check:
> no reply from osd.7 since back 2016-02-29 05:41:52.790422 front
> 2016-02-29 05:41:52.790422 (cutoff 2016-02-29 06:08:39.230753)
> 
> 
> Any idea what could be the trouble of the cluster ?
> 
> Thank you !
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com