70+ OSD are DOWN and not coming up

karan.singh@xxxxxx (Karan Singh) · Wed, 21 May 2014 15:37:50 +0300

Hello Sage

nodown, noout set on cluster

# ceph status
    cluster 009d3518-e60d-4f74-a26d-c08c1976263c
     health HEALTH_WARN 1133 pgs degraded; 44 pgs incomplete; 42 pgs stale; 45 pgs stuck inactive; 42 pgs stuck stale; 2602 pgs stuck unclean; recovery 206/2199 objects degraded (9.368%); 40/165 in osds are down; nodown,noout flag(s) set
     monmap e4: 4 mons at {storage0101-ib=192.168.100.101:6789/0,storage0110-ib=192.168.100.110:6789/0,storage0114-ib=192.168.100.114:6789/0,storage0115-ib=192.168.100.115:6789/0}, election epoch 18, quorum 0,1,2,3 storage0101-ib,storage0110-ib,storage0114-ib,storage0115-ib
     osdmap e358031: 165 osds: 125 up, 165 in
            flags nodown,noout
      pgmap v604305: 4544 pgs, 6 pools, 4309 MB data, 733 objects
            3582 GB used, 357 TB / 361 TB avail
            206/2199 objects degraded (9.368%)
                   1 inactive
                   5 stale+active+degraded+remapped
                1931 active+clean
                   2 stale+incomplete
                  21 stale+active+remapped
                 380 active+degraded+remapped
                  38 incomplete
                1403 active+remapped
                   2 stale+active+degraded
                   1 stale+remapped+incomplete
                 746 active+degraded
                  11 stale+active+clean
                   3 remapped+incomplete

 Here is my ceph.conf  http://pastebin.com/KZdgPJm7  (debus osd , ms set )
I tried restarting all OSD services of  node-13 , services came up after several attempts of ?service ceph restart?   http://pastebin.com/yMk86YHh
For Node : 14
All services are up

[root at storage0114-ib ~]# service ceph status
=== osd.142 ===
osd.142: running {"version":"0.80-475-g9e80c29"}
=== osd.36 ===
osd.36: running {"version":"0.80-475-g9e80c29"}
=== osd.83 ===
osd.83: running {"version":"0.80-475-g9e80c29"}
=== osd.107 ===
osd.107: running {"version":"0.80-475-g9e80c29"}
=== osd.47 ===
osd.47: running {"version":"0.80-475-g9e80c29"}
=== osd.130 ===
osd.130: running {"version":"0.80-475-g9e80c29"}
=== osd.155 ===
osd.155: running {"version":"0.80-475-g9e80c29"}
=== osd.60 ===
osd.60: running {"version":"0.80-475-g9e80c29"}
=== osd.118 ===
osd.118: running {"version":"0.80-475-g9e80c29"}
=== osd.98 ===
osd.98: running {"version":"0.80-475-g9e80c29"}
=== osd.70 ===
osd.70: running {"version":"0.80-475-g9e80c29"}
=== mon.storage0114-ib ===
mon.storage0114-ib: running {"version":"0.80-475-g9e80c29"}
[root at storage0114-ib ~]#

		? But ceph osd tree says , osd.118  is down

-10	29.93		host storage0114-ib
36	2.63			osd.36	up	1
47	2.73			osd.47	up	1
60	2.73			osd.60	up	1
70	2.73			osd.70	up	1
83	2.73			osd.83	up	1
98	2.73			osd.98	up	1
107	2.73			osd.107	up	1
118	2.73			osd.118	down	1
130	2.73			osd.130	up	1
142	2.73			osd.142	up	1
155	2.73			osd.155	up	1

? I restarted osd.118 service and it was successful , But still its showing as down in ceph osd tree . I waited for 30 minutes to get it stable but still not showing UP in ceph osd tree. 
Moreover its generating HUGE logs http://pastebin.com/mDYnjAni

The problem now is if i manually visit every host and check for ?service ceph status ? all services are running on all 15 hosts. But this is not getting reflected to ceph osd tree and ceph -s and they continue to show as OSD DOWN.

My irc id is ksingh , let me know by email once you are available on IRC (my time zone is Finland +2)

- Karan Singh - 

On 20 May 2014, at 18:18, Sage Weil <sage at inktank.com> wrote:

> On Tue, 20 May 2014, Karan Singh wrote:
>> Hello Cephers , need your suggestion for troubleshooting.
>> 
>> My cluster is terribly struggling , 70+ osd are down out of 165
>> 
>> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
>> degraded. On checking logs of failed OSD we are getting wired entries that
>> are continuously getting generated.
> 
> Tracking this at http://tracker.ceph.com/issues/8387
> 
> The most recent bits you posted in the ticket don't quite make sense: the 
> OSD is trying to connect to an address for an OSD that is currently marked 
> down.  I suspect this is just timing between when the logs were captured 
> and when teh ceph osd dump was captured.  To get a complete pictures, 
> please:
> 
> 1) add
> 
> debug osd = 20
> debug ms = 1
> 
> in [osd] and restart all osds
> 
> 2) ceph osd set nodown
> 
> (to prevent flapping)
> 
> 3) find some OSD that is showing these messages
> 
> 4) capture a 'ceph osd dump' output.
> 
> Also happy to debug this interactively over IRC; that will likely be 
> faster!
> 
> Thanks-
> sage
> 
> 
> 
>> 
>> Osd Debug logs ::  http://pastebin.com/agTKh6zB
>> 
>> 
>> 1. 2014-05-20 10:19:03.699886 7f2328e237a0  0 osd.158 357532 done with
>>    init, starting boot process
>> 2. 2014-05-20 10:19:03.700093 7f22ff621700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>>    l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not
>>    192.168.1.109:6802/910005982 - wrong node!
>> 3. 2014-05-20 10:19:03.700152 7f22ff621700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>>    l=0 c=0x83018c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:19:09.551269 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>>    l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not
>>    192.168.1.109:6803/1176009454 - wrong node!
>> 5. 2014-05-20 10:19:09.551347 7f22fdd12700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>>    l=0 c=0x533fd20).fault with nothing to send, going to standby
>> 6. 2014-05-20 10:19:09.703901 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>>    c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not
>>    192.168.1.113:6802/13870 - wrong node!
>> 7. 2014-05-20 10:19:09.704039 7f22fd80d700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>>    c=0x8302aa0).fault with nothing to send, going to standby
>> 8. 2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>>    c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>>    192.168.1.112:6800/14114 - wrong node!
>> 9. 2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>>    c=0x8304780).fault with nothing to send, going to standby
>> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>>    c=0x83070c0).fault with nothing to send, going to standby
>> 
>> 
>> 1. ceph -v
>>    ceph version 0.80-469-g991f7f1
>>    (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
>> 1. ceph osd stat
>>    osdmap e357073: 165 osds: 91 up, 165 in
>>    flags noout #
>> 
>> I have tried doing :
>> 
>> 1. Restarting the problematic OSDs , but no luck
>> 2.  i restarted entire host but no luck, still osds are down and getting the
>> same mesage
>> 
>> 1. 2014-05-20 10:19:10.243139 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>>    c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>>    192.168.1.112:6800/14114 - wrong node!
>> 2. 2014-05-20 10:19:10.243190 7f22fd005700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>>    c=0x8304780).fault with nothing to send, going to standby
>> 3. 2014-05-20 10:19:10.349693 7f22fc7fd700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>>    c=0x83070c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:22:23.312473 7f2307e61700  0 osd.158 357781 do_command r=0
>> 5. 2014-05-20 10:22:23.326110 7f2307e61700  0 osd.158 357781 do_command r=0
>>    debug_osd=0/5
>> 6. 2014-05-20 10:22:23.326123 7f2307e61700  0 log [INF] : debug_osd=0/5
>> 7. 2014-05-20 10:34:08.161864 7f230224d700  0 -- 192.168.1.112:6802/3807 >>
>>    192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1
>>    l=0 c=0x8301600).fault with nothing to send, going to standby
>> 
>> 3. Disks do not have errors , no message in dmesg and /var/log/messages
>> 
>> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont
>> know it again came bacin in Firefly
>> 
>> 5. Recently no activity performed on cluster , except some pool and keys
>> creation for cinder /glance integration
>> 
>> 6. Nodes have enough free resources for osds.
>> 
>> 7. No issues with network , osds are down on all cluster nodes. not from a
>> single node.
>> 
>> 
>> ****************************************************************
>> Karan Singh 
>> Systems Specialist , Storage Platforms
>> CSC - IT Center for Science,
>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>> mobile: +358 503 812758
>> tel. +358 9 4572001
>> fax +358 9 4572302
>> http://www.csc.fi/
>> ****************************************************************
>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140521/c09c05c3/attachment.htm>