Re: Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Karan,

We faced same issue and resolved after increasing the open file limit and maximum no of threads

Config reference

/etc/security/limit.conf

root hard nofile 65535

sysctl -w kernel.pid_max=4194303

http://tracker.ceph.com/issues/10554#change-47024

Cheers

Mohamed Pakkeer

On Mon, Mar 9, 2015 at 4:20 PM, Azad Aliyar <azad.aliyar@xxxxxxxxxxxxxxxx> wrote:
  • Check Max Threadcount: If you have a node with a lot of OSDs, you may be hitting the default maximum number of threads (e.g., usually 32k), especially during recovery. You can increase the number of threads using sysctl to see if increasing the maximum number of threads to the maximum possible number of threads allowed (i.e., 4194303) will help. For example:

    sysctl -w kernel.pid_max=4194303

    If increasing the maximum thread count resolves the issue, you can make it permanent by including a kernel.pid_max setting in the /etc/sysctl.conf file. For example:

    kernel.pid_max = 4194303
    

  • On Mon, Mar 9, 2015 at 4:11 PM, Karan Singh <karan.singh@xxxxxx> wrote:
    Hello Community need help to fix a long going Ceph problem.

    Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart OSD’s i am getting this error 


    2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970
    common/Thread.cc: 129: FAILED assert(ret == 0)


    Environment :  4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 , 3.17.2-1.el6.elrepo.x86_64

    Tried upgrading from 0.80.7 to 0.80.8  but no Luck

    Tried centOS stock kernel 2.6.32  but no Luck

    Memory is not a problem more then 150+GB is free 


    Did any one every faced this problem ??

    Cluster status 

       cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
         health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs incomplete; 1735 pgs peering; 8938 pgs stale; 1
    736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean; recovery 6061/31080 objects degraded (19
    .501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02, mon.pouta-s03
         monmap e3: 3 mons at {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX.50.3:6789
    /0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03
         osdmap e26633: 239 osds: 85 up, 196 in
          pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects
                4699 GB used, 707 TB / 711 TB avail
                6061/31080 objects degraded (19.501%)
                      14 down+remapped+peering
                      39 active
                    3289 active+clean
                     547 peering
                     663 stale+down+peering
                     705 stale+active+remapped
                       1 active+degraded+remapped
                       1 stale+down+incomplete
                     484 down+peering
                     455 active+remapped
                    3696 stale+active+degraded
                       4 remapped+peering
                      23 stale+down+remapped+peering
                      51 stale+active
                    3637 active+degraded
                    3799 stale+active+clean

    OSD :  Logs 

    2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970
    common/Thread.cc: 129: FAILED assert(ret == 0)

     ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
     1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
     2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
     3: (Accepter::entry()+0x265) [0xb5c635]
     4: /lib64/libpthread.so.0() [0x3c8a6079d1]
     5: (clone()+0x6d) [0x3c8a2e89dd]
     NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


    More information at Ceph Tracker Issue :  http://tracker.ceph.com/issues/10988#change-49018


    ****************************************************************
    Karan Singh 
    Systems Specialist , Storage Platforms
    CSC - IT Center for Science,
    Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
    mobile: +358 503 812758
    tel. +358 9 4572001
    fax +358 9 4572302
    http://www.csc.fi/
    ****************************************************************


    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




    --
    Warm Regards,
    Azad Aliyar
    Linux Server Engineer
    Emailazad.aliyar@xxxxxxxxxxxxxxxx   |   Skype :   spark.azad
    3rd Floor, Leela Infopark, Phase -2,Kakanad, Kochi-30, Kerala, India
    Phone:+91 484 6561696 , Mobile:91-8129270421.
    Confidentiality Notice: Information in this e-mail is proprietary to SparkSupport. and is intended for use only by the addressed, and may contain information that is privileged, confidential or exempt from disclosure. If you are not the intended recipient, you are notified that any use of this information in any manner is strictly prohibited. Please delete this mail & notify us immediately at info@xxxxxxxxxxxxxxxx

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




    --
    Thanks & Regards   
    K.Mohamed Pakkeer
    Mobile- 0091-8754410114

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    

    [Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


      Powered by Linux