70+ OSD are DOWN and not coming up

karan.singh@xxxxxx (Karan Singh) · Tue, 20 May 2014 12:02:18 +0300

Hello Cephers , need your suggestion for troubleshooting.

My cluster is terribly struggling , 70+ osd are down out of 165

Problem ?> OSD are getting marked out of cluster and are down. The cluster is degraded. On checking logs of failed OSD we are getting wired entries that are continuously getting generated.

Osd Debug logs ::  http://pastebin.com/agTKh6zB

2014-05-20 10:19:03.699886 7f2328e237a0  0 osd.158 357532 done with init, starting boot process
2014-05-20 10:19:03.700093 7f22ff621700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not 192.168.1.109:6802/910005982 - wrong node!
2014-05-20 10:19:03.700152 7f22ff621700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 l=0 c=0x83018c0).fault with nothing to send, going to standby
2014-05-20 10:19:09.551269 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not 192.168.1.109:6803/1176009454 - wrong node!
2014-05-20 10:19:09.551347 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 l=0 c=0x533fd20).fault with nothing to send, going to standby
2014-05-20 10:19:09.703901 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not 192.168.1.113:6802/13870 - wrong node!
2014-05-20 10:19:09.704039 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 c=0x8302aa0).fault with nothing to send, going to standby
2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not 192.168.1.112:6800/14114 - wrong node!
2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).fault with nothing to send, going to standby
2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 c=0x83070c0).fault with nothing to send, going to standby

ceph -v
ceph version 0.80-469-g991f7f1 (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
ceph osd stat
osdmap e357073: 165 osds: 91 up, 165 in
flags noout #
I have tried doing :

1. Restarting the problematic OSDs , but no luck
2.  i restarted entire host but no luck, still osds are down and getting the same mesage

2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not 192.168.1.112:6800/14114 - wrong node!
2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).fault with nothing to send, going to standby
2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 c=0x83070c0).fault with nothing to send, going to standby
2014-05-20 10:22:23.312473 7f2307e61700  0 osd.158 357781 do_command r=0
2014-05-20 10:22:23.326110 7f2307e61700  0 osd.158 357781 do_command r=0 debug_osd=0/5
2014-05-20 10:22:23.326123 7f2307e61700  0 log [INF] : debug_osd=0/5
2014-05-20 10:34:08.161864 7f230224d700  0 -- 192.168.1.112:6802/3807 >> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1 l=0 c=0x8301600).fault with nothing to send, going to standby

3. Disks do not have errors , no message in dmesg and /var/log/messages

4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont know it again came bacin in Firefly

5. Recently no activity performed on cluster , except some pool and keys creation for cinder /glance integration

6. Nodes have enough free resources for osds.

7. No issues with network , osds are down on all cluster nodes. not from a single node.

****************************************************************
Karan Singh 
Systems Specialist , Storage Platforms
CSC - IT Center for Science,
Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
mobile: +358 503 812758
tel. +358 9 4572001
fax +358 9 4572302
http://www.csc.fi/
****************************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140520/56d5e719/attachment.htm>