Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Community need help to fix a long going Ceph problem.

Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart OSD’s i am getting this error 


2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970
common/Thread.cc: 129: FAILED assert(ret == 0)


Environment :  4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 , 3.17.2-1.el6.elrepo.x86_64

Tried upgrading from 0.80.7 to 0.80.8  but no Luck

Tried centOS stock kernel 2.6.32  but no Luck

Memory is not a problem more then 150+GB is free 


Did any one every faced this problem ??

Cluster status 

   cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
     health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs incomplete; 1735 pgs peering; 8938 pgs stale; 1
736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean; recovery 6061/31080 objects degraded (19
.501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02, mon.pouta-s03
     monmap e3: 3 mons at {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX.50.3:6789
/0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03
     osdmap e26633: 239 osds: 85 up, 196 in
      pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects
            4699 GB used, 707 TB / 711 TB avail
            6061/31080 objects degraded (19.501%)
                  14 down+remapped+peering
                  39 active
                3289 active+clean
                 547 peering
                 663 stale+down+peering
                 705 stale+active+remapped
                   1 active+degraded+remapped
                   1 stale+down+incomplete
                 484 down+peering
                 455 active+remapped
                3696 stale+active+degraded
                   4 remapped+peering
                  23 stale+down+remapped+peering
                  51 stale+active
                3637 active+degraded
                3799 stale+active+clean

OSD :  Logs 

2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
 3: (Accepter::entry()+0x265) [0xb5c635]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


More information at Ceph Tracker Issue :  http://tracker.ceph.com/issues/10988#change-49018


****************************************************************
Karan Singh 
Systems Specialist , Storage Platforms
CSC - IT Center for Science,
Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
mobile: +358 503 812758
tel. +358 9 4572001
fax +358 9 4572302
http://www.csc.fi/
****************************************************************

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux