On Tue, 20 May 2014, Karan Singh wrote: > Hello Cephers , need your suggestion for troubleshooting. > > My cluster is terribly struggling , 70+ osd are down out of 165 > > Problem ?>OSD are getting marked out of cluster and are down. The cluster is > degraded. On checking logs of failed OSD we are getting wired entries that > are continuously getting generated. Tracking this at http://tracker.ceph.com/issues/8387 The most recent bits you posted in the ticket don't quite make sense: the OSD is trying to connect to an address for an OSD that is currently marked down. I suspect this is just timing between when the logs were captured and when teh ceph osd dump was captured. To get a complete pictures, please: 1) add debug osd = 20 debug ms = 1 in [osd] and restart all osds 2) ceph osd set nodown (to prevent flapping) 3) find some OSD that is showing these messages 4) capture a 'ceph osd dump' output. Also happy to debug this interactively over IRC; that will likely be faster! Thanks- sage > > Osd Debug logs :: http://pastebin.com/agTKh6zB > > > 1. 2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with > init, starting boot process > 2. 2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 > l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not > 192.168.1.109:6802/910005982 - wrong node! > 3. 2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 > l=0 c=0x83018c0).fault with nothing to send, going to standby > 4. 2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 > l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not > 192.168.1.109:6803/1176009454 - wrong node! > 5. 2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 > l=0 c=0x533fd20).fault with nothing to send, going to standby > 6. 2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 > c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not > 192.168.1.113:6802/13870 - wrong node! > 7. 2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 > c=0x8302aa0).fault with nothing to send, going to standby > 8. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 > c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not > 192.168.1.112:6800/14114 - wrong node! > 9. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 > c=0x8304780).fault with nothing to send, going to standby > 10. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 > c=0x83070c0).fault with nothing to send, going to standby > > > 1. ceph -v > ceph version 0.80-469-g991f7f1 > (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) # > 1. ceph osd stat > osdmap e357073: 165 osds: 91 up, 165 in > flags noout # > > I have tried doing : > > 1. Restarting the problematic OSDs , but no luck > 2. i restarted entire host but no luck, still osds are down and getting the > same mesage > > 1. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 > c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not > 192.168.1.112:6800/14114 - wrong node! > 2. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 > c=0x8304780).fault with nothing to send, going to standby > 3. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 > c=0x83070c0).fault with nothing to send, going to standby > 4. 2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0 > 5. 2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0 > debug_osd=0/5 > 6. 2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5 > 7. 2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >> > 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1 > l=0 c=0x8301600).fault with nothing to send, going to standby > > 3. Disks do not have errors , no message in dmesg and /var/log/messages > > 4. there was a bug in the past http://tracker.ceph.com/issues/4006 ;, dont > know it again came bacin in Firefly > > 5. Recently no activity performed on cluster , except some pool and keys > creation for cinder /glance integration > > 6. Nodes have enough free resources for osds. > > 7. No issues with network , osds are down on all cluster nodes. not from a > single node. > > > **************************************************************** > Karan Singh > Systems Specialist , Storage Platforms > CSC - IT Center for Science, > Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland > mobile: +358 503 812758 > tel. +358 9 4572001 > fax +358 9 4572302 > http://www.csc.fi/ > **************************************************************** > > >