Re: Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

Nicheal <zay11022@xxxxxxxxx> · Tue, 10 Mar 2015 11:56:56 +0800



2015-03-10 3:01 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
> On Mon, 9 Mar 2015, Karan Singh wrote:
>> Thanks Guys kernel.pid_max=4194303 did the trick.
>
> Great to hear!  Sorry we missed that you only had it at 65536.
>
> This is a really common problem that people hit when their clusters start
> to grow.  Is there somewhere in the docs we can put this to catch more
> users?  Or maybe a warning issued by the osds themselves or something if
> they see limits that are low?
>
> sage
>
Um, I think we can add the command to the shell script
/etc/init.d/ceph.  Something like we deal with the max fd limitation
(ulimit -n 32768). Thus, if we use command "service ceph start osd.*"
to start osds, it will be automatically changed to the proper value.

>> - Karan -
>>
>>       On 09 Mar 2015, at 14:48, Christian Eichelmann
>>       <christian.eichelmann@xxxxxxxx> wrote:
>>
>> Hi Karan,
>>
>> as you are actually writing in your own book, the problem is the
>> sysctl
>> setting "kernel.pid_max". I've seen in your bug report that you were
>> setting it to 65536, which is still to low for high density hardware.
>>
>> In our cluster, one OSD server has in an idle situation about 66.000
>> Threads (60 OSDs per Server). The number of threads increases when you
>> increase the number of placement groups in the cluster, which I think
>> has triggered your problem.
>>
>> Set the "kernel.pid_max" setting to 4194303 (the maximum) like Azad
>> Aliyar suggested, and the problem should be gone.
>>
>> Regards,
>> Christian
>>
>> Am 09.03.2015 11:41, schrieb Karan Singh:
>>       Hello Community need help to fix a long going Ceph
>>       problem.
>>
>>       Cluster is unhealthy , Multiple OSDs are DOWN. When i am
>>       trying to
>>       restart OSD?s i am getting this error
>>
>>
>>       /2015-03-09 12:22:16.312774 7f760dac9700 -1
>>       common/Thread.cc
>>       <http://Thread.cc>: In function 'void
>>       Thread::create(size_t)' thread
>>       7f760dac9700 time 2015-03-09 12:22:16.311970/
>>       /common/Thread.cc <http://Thread.cc>: 129: FAILED
>>       assert(ret == 0)/
>>
>>
>>       *Environment *:  4 Nodes , OSD+Monitor , Firefly latest ,
>>       CentOS6.5
>>       , 3.17.2-1.el6.elrepo.x86_64
>>
>>       Tried upgrading from 0.80.7 to 0.80.8  but no Luck
>>
>>       Tried centOS stock kernel 2.6.32  but no Luck
>>
>>       Memory is not a problem more then 150+GB is free
>>
>>
>>       Did any one every faced this problem ??
>>
>>       *Cluster status *
>>       *
>>       *
>>       / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/
>>       /     health HEALTH_WARN 7334 pgs degraded; 1185 pgs down;
>>       1 pgs
>>       incomplete; 1735 pgs peering; 8938 pgs stale; 1/
>>       /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs
>>       stuck unclean;
>>       recovery 6061/31080 objects degraded (19/
>>       /.501%); 111/196 in osds are down; clock skew detected on
>>       mon.pouta-s02,
>>       mon.pouta-s03/
>>       /     monmap e3: 3 mons at
>> {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX
>>       .50.3:6789/
>>       //0}, election epoch 1312, quorum 0,1,2
>>       pouta-s01,pouta-s02,pouta-s03/
>>       /   * osdmap e26633: 239 osds: 85 up, 196 in*/
>>       /      pgmap v60389: 17408 pgs, 13 pools, 42345 MB data,
>>       10360 objects/
>>       /            4699 GB used, 707 TB / 711 TB avail/
>>       /            6061/31080 objects degraded (19.501%)/
>>       /                  14 down+remapped+peering/
>>       /                  39 active/
>>       /                3289 active+clean/
>>       /                 547 peering/
>>       /                 663 stale+down+peering/
>>       /                 705 stale+active+remapped/
>>       /                   1 active+degraded+remapped/
>>       /                   1 stale+down+incomplete/
>>       /                 484 down+peering/
>>       /                 455 active+remapped/
>>       /                3696 stale+active+degraded/
>>       /                   4 remapped+peering/
>>       /                  23 stale+down+remapped+peering/
>>       /                  51 stale+active/
>>       /                3637 active+degraded/
>>       /                3799 stale+active+clean/
>>
>>       *OSD :  Logs *
>>
>>       /2015-03-09 12:22:16.312774 7f760dac9700 -1
>>       common/Thread.cc
>>       <http://Thread.cc>: In function 'void
>>       Thread::create(size_t)' thread
>>       7f760dac9700 time 2015-03-09 12:22:16.311970/
>>       /common/Thread.cc <http://Thread.cc>: 129: FAILED
>>       assert(ret == 0)/
>>       /
>>       /
>>       / ceph version 0.80.8
>>       (69eaad7f8308f21573c604f121956e64679a52a7)/
>>       / 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]/
>>       / 2: (SimpleMessenger::add_accept_pipe(int)+0x6a)
>>       [0xae84fa]/
>>       / 3: (Accepter::entry()+0x265) [0xb5c635]/
>>       / 4: /lib64/libpthread.so.0() [0x3c8a6079d1]/
>>       / 5: (clone()+0x6d) [0x3c8a2e89dd]/
>>       / NOTE: a copy of the executable, or `objdump -rdS
>>       <executable>` is
>>       needed to interpret this./
>>
>>
>>       *More information at Ceph Tracker Issue :
>>       *http://tracker.ceph.com/issues/10988#change-49018
>>
>>
>>       ****************************************************************
>>       Karan Singh
>>       Systems Specialist , Storage Platforms
>>       CSC - IT Center for Science,
>>       Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>>       mobile: +358 503 812758
>>       tel. +358 9 4572001
>>       fax +358 9 4572302
>>       http://www.csc.fi/
>>       ****************************************************************
>>
>>
>>
>>       _______________________________________________
>>       ceph-users mailing list
>>       ceph-users@xxxxxxxxxxxxxx
>>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Christian Eichelmann
>> Systemadministrator
>>
>> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
>> Brauerstraße 48 · DE-76135 Karlsruhe
>> Telefon: +49 721 91374-8026
>> christian.eichelmann@xxxxxxxx
>>
>> Amtsgericht Montabaur / HRB 6484
>> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
>> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan
>> Oetjen
>> Aufsichtsratsvorsitzender: Michael Scheeren
>>
>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com