Hello Sage nodown, noout set on cluster # ceph status cluster 009d3518-e60d-4f74-a26d-c08c1976263c health HEALTH_WARN 1133 pgs degraded; 44 pgs incomplete; 42 pgs stale; 45 pgs stuck inactive; 42 pgs stuck stale; 2602 pgs stuck unclean; recovery 206/2199 objects degraded (9.368%); 40/165 in osds are down; nodown,noout flag(s) set monmap e4: 4 mons at {storage0101-ib=192.168.100.101:6789/0,storage0110-ib=192.168.100.110:6789/0,storage0114-ib=192.168.100.114:6789/0,storage0115-ib=192.168.100.115:6789/0}, election epoch 18, quorum 0,1,2,3 storage0101-ib,storage0110-ib,storage0114-ib,storage0115-ib osdmap e358031: 165 osds: 125 up, 165 in flags nodown,noout pgmap v604305: 4544 pgs, 6 pools, 4309 MB data, 733 objects 3582 GB used, 357 TB / 361 TB avail 206/2199 objects degraded (9.368%) 1 inactive 5 stale+active+degraded+remapped 1931 active+clean 2 stale+incomplete 21 stale+active+remapped 380 active+degraded+remapped 38 incomplete 1403 active+remapped 2 stale+active+degraded 1 stale+remapped+incomplete 746 active+degraded 11 stale+active+clean 3 remapped+incomplete Here is my ceph.conf http://pastebin.com/KZdgPJm7 (debus osd , ms set ) I tried restarting all OSD services of node-13 , services came up after several attempts of ?service ceph restart? http://pastebin.com/yMk86YHh For Node : 14 All services are up [root at storage0114-ib ~]# service ceph status === osd.142 === osd.142: running {"version":"0.80-475-g9e80c29"} === osd.36 === osd.36: running {"version":"0.80-475-g9e80c29"} === osd.83 === osd.83: running {"version":"0.80-475-g9e80c29"} === osd.107 === osd.107: running {"version":"0.80-475-g9e80c29"} === osd.47 === osd.47: running {"version":"0.80-475-g9e80c29"} === osd.130 === osd.130: running {"version":"0.80-475-g9e80c29"} === osd.155 === osd.155: running {"version":"0.80-475-g9e80c29"} === osd.60 === osd.60: running {"version":"0.80-475-g9e80c29"} === osd.118 === osd.118: running {"version":"0.80-475-g9e80c29"} === osd.98 === osd.98: running {"version":"0.80-475-g9e80c29"} === osd.70 === osd.70: running {"version":"0.80-475-g9e80c29"} === mon.storage0114-ib === mon.storage0114-ib: running {"version":"0.80-475-g9e80c29"} [root at storage0114-ib ~]# ? But ceph osd tree says , osd.118 is down -10 29.93 host storage0114-ib 36 2.63 osd.36 up 1 47 2.73 osd.47 up 1 60 2.73 osd.60 up 1 70 2.73 osd.70 up 1 83 2.73 osd.83 up 1 98 2.73 osd.98 up 1 107 2.73 osd.107 up 1 118 2.73 osd.118 down 1 130 2.73 osd.130 up 1 142 2.73 osd.142 up 1 155 2.73 osd.155 up 1 ? I restarted osd.118 service and it was successful , But still its showing as down in ceph osd tree . I waited for 30 minutes to get it stable but still not showing UP in ceph osd tree. Moreover its generating HUGE logs http://pastebin.com/mDYnjAni The problem now is if i manually visit every host and check for ?service ceph status ? all services are running on all 15 hosts. But this is not getting reflected to ceph osd tree and ceph -s and they continue to show as OSD DOWN. My irc id is ksingh , let me know by email once you are available on IRC (my time zone is Finland +2) - Karan Singh - On 20 May 2014, at 18:18, Sage Weil <sage at inktank.com> wrote: > On Tue, 20 May 2014, Karan Singh wrote: >> Hello Cephers , need your suggestion for troubleshooting. >> >> My cluster is terribly struggling , 70+ osd are down out of 165 >> >> Problem ?>OSD are getting marked out of cluster and are down. The cluster is >> degraded. On checking logs of failed OSD we are getting wired entries that >> are continuously getting generated. > > Tracking this at http://tracker.ceph.com/issues/8387 > > The most recent bits you posted in the ticket don't quite make sense: the > OSD is trying to connect to an address for an OSD that is currently marked > down. I suspect this is just timing between when the logs were captured > and when teh ceph osd dump was captured. To get a complete pictures, > please: > > 1) add > > debug osd = 20 > debug ms = 1 > > in [osd] and restart all osds > > 2) ceph osd set nodown > > (to prevent flapping) > > 3) find some OSD that is showing these messages > > 4) capture a 'ceph osd dump' output. > > Also happy to debug this interactively over IRC; that will likely be > faster! > > Thanks- > sage > > > >> >> Osd Debug logs :: http://pastebin.com/agTKh6zB >> >> >> 1. 2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with >> init, starting boot process >> 2. 2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 >> l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not >> 192.168.1.109:6802/910005982 - wrong node! >> 3. 2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 >> l=0 c=0x83018c0).fault with nothing to send, going to standby >> 4. 2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 >> l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not >> 192.168.1.109:6803/1176009454 - wrong node! >> 5. 2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 >> l=0 c=0x533fd20).fault with nothing to send, going to standby >> 6. 2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 >> c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not >> 192.168.1.113:6802/13870 - wrong node! >> 7. 2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 >> c=0x8302aa0).fault with nothing to send, going to standby >> 8. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 >> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not >> 192.168.1.112:6800/14114 - wrong node! >> 9. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 >> c=0x8304780).fault with nothing to send, going to standby >> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 >> c=0x83070c0).fault with nothing to send, going to standby >> >> >> 1. ceph -v >> ceph version 0.80-469-g991f7f1 >> (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) # >> 1. ceph osd stat >> osdmap e357073: 165 osds: 91 up, 165 in >> flags noout # >> >> I have tried doing : >> >> 1. Restarting the problematic OSDs , but no luck >> 2. i restarted entire host but no luck, still osds are down and getting the >> same mesage >> >> 1. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 >> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not >> 192.168.1.112:6800/14114 - wrong node! >> 2. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 >> c=0x8304780).fault with nothing to send, going to standby >> 3. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 >> c=0x83070c0).fault with nothing to send, going to standby >> 4. 2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0 >> 5. 2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0 >> debug_osd=0/5 >> 6. 2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5 >> 7. 2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >> >> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1 >> l=0 c=0x8301600).fault with nothing to send, going to standby >> >> 3. Disks do not have errors , no message in dmesg and /var/log/messages >> >> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont >> know it again came bacin in Firefly >> >> 5. Recently no activity performed on cluster , except some pool and keys >> creation for cinder /glance integration >> >> 6. Nodes have enough free resources for osds. >> >> 7. No issues with network , osds are down on all cluster nodes. not from a >> single node. >> >> >> **************************************************************** >> Karan Singh >> Systems Specialist , Storage Platforms >> CSC - IT Center for Science, >> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland >> mobile: +358 503 812758 >> tel. +358 9 4572001 >> fax +358 9 4572302 >> http://www.csc.fi/ >> **************************************************************** >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140521/c09c05c3/attachment.htm>