Re: [ceph-users] 70+ OSD are DOWN and not coming up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 20 May 2014, Karan Singh wrote:
> Hello Cephers , need your suggestion for troubleshooting.
> 
> My cluster is terribly struggling , 70+ osd are down out of 165
> 
> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
> degraded. On checking logs of failed OSD we are getting wired entries that
> are continuously getting generated.

Tracking this at http://tracker.ceph.com/issues/8387

The most recent bits you posted in the ticket don't quite make sense: the 
OSD is trying to connect to an address for an OSD that is currently marked 
down.  I suspect this is just timing between when the logs were captured 
and when teh ceph osd dump was captured.  To get a complete pictures, 
please:

1) add

 debug osd = 20
 debug ms = 1

in [osd] and restart all osds

2) ceph osd set nodown

(to prevent flapping)

3) find some OSD that is showing these messages

4) capture a 'ceph osd dump' output.

Also happy to debug this interactively over IRC; that will likely be 
faster!

Thanks-
sage



> 
> Osd Debug logs ::  http://pastebin.com/agTKh6zB
> 
> 
>  1. 2014-05-20 10:19:03.699886 7f2328e237a0  0 osd.158 357532 done with
>     init, starting boot process
>  2. 2014-05-20 10:19:03.700093 7f22ff621700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>     l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not
>     192.168.1.109:6802/910005982 - wrong node!
>  3. 2014-05-20 10:19:03.700152 7f22ff621700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>     l=0 c=0x83018c0).fault with nothing to send, going to standby
>  4. 2014-05-20 10:19:09.551269 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>     l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not
>     192.168.1.109:6803/1176009454 - wrong node!
>  5. 2014-05-20 10:19:09.551347 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>     l=0 c=0x533fd20).fault with nothing to send, going to standby
>  6. 2014-05-20 10:19:09.703901 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>     c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not
>     192.168.1.113:6802/13870 - wrong node!
>  7. 2014-05-20 10:19:09.704039 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>     c=0x8302aa0).fault with nothing to send, going to standby
>  8. 2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>     c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>     192.168.1.112:6800/14114 - wrong node!
>  9. 2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>     c=0x8304780).fault with nothing to send, going to standby
> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>     c=0x83070c0).fault with nothing to send, going to standby
> 
> 
>  1. ceph -v
>     ceph version 0.80-469-g991f7f1
>     (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
>  1. ceph osd stat
>     osdmap e357073: 165 osds: 91 up, 165 in
>     flags noout #
> 
> I have tried doing :
> 
> 1. Restarting the problematic OSDs , but no luck
> 2.  i restarted entire host but no luck, still osds are down and getting the
> same mesage
> 
>  1. 2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>     c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>     192.168.1.112:6800/14114 - wrong node!
>  2. 2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>     c=0x8304780).fault with nothing to send, going to standby
>  3. 2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>     c=0x83070c0).fault with nothing to send, going to standby
>  4. 2014-05-20 10:22:23.312473 7f2307e61700  0 osd.158 357781 do_command r=0
>  5. 2014-05-20 10:22:23.326110 7f2307e61700  0 osd.158 357781 do_command r=0
>     debug_osd=0/5
>  6. 2014-05-20 10:22:23.326123 7f2307e61700  0 log [INF] : debug_osd=0/5
>  7. 2014-05-20 10:34:08.161864 7f230224d700  0 -- 192.168.1.112:6802/3807 >>
>     192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1
>     l=0 c=0x8301600).fault with nothing to send, going to standby
> 
> 3. Disks do not have errors , no message in dmesg and /var/log/messages
> 
> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 ;, dont
> know it again came bacin in Firefly
> 
> 5. Recently no activity performed on cluster , except some pool and keys
> creation for cinder /glance integration
> 
> 6. Nodes have enough free resources for osds.
> 
> 7. No issues with network , osds are down on all cluster nodes. not from a
> single node.
> 
> 
> ****************************************************************
> Karan Singh 
> Systems Specialist , Storage Platforms
> CSC - IT Center for Science,
> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> mobile: +358 503 812758
> tel. +358 9 4572001
> fax +358 9 4572302
> http://www.csc.fi/
> ****************************************************************
> 
> 
> 

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux