16 osds: 11 up, 16 in

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Wed, 07 May 2014 10:28:54 -0700

The 5 OSDs that are down have all been kicked out for being 
unresponsive.  The 5 OSDs are getting kicked faster than they can 
complete the recovery+backfill.  The number of degraded PGs is growing 
over time.

root at ceph0c:~# ceph -w
     cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
      health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs 
down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs 
stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck 
unclean; 66 requests are blocked > 32 sec; recovery 6820014/38055556 
objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set
      monmap e2: 2 mons at 
{ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238, 
quorum 0,1 ceph0c,ceph1c
      osdmap e38673: 16 osds: 12 up, 16 in
             flags noout
       pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects
             28456 GB used, 31132 GB / 59588 GB avail
             6820014/38055556 objects degraded (17.921%)
                    1 stale+active+clean+scrubbing+deep
                   15 active
                 1247 active+clean
                    1 active+recovery_wait
                   45 stale+active+clean
                   39 peering
                   29 stale+active+degraded+wait_backfill
                  252 down+peering
                  827 active+degraded
                   50 stale+active+degraded
                   20 stale+active+degraded+remapped+wait_backfill
                   30 stale+incomplete
                    4 active+clean+scrubbing+deep

Here's a snippet of ceph.log for one of these OSDs:
2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3 
marked down after no pg stats for 901.212859seconds
2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3 
10.193.0.6:6812/2830 boot
2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map 
e38649 wrongly marked me down

root at ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12 
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root at ceph0c:~# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 12.04.4 LTS
Release:    12.04
Codename:    precise
root at ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

Any ideas what I can do to make these OSDs stop drying after 15 minutes?

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/5cd10e6f/attachment.htm>