2015-03-10 3:01 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: > On Mon, 9 Mar 2015, Karan Singh wrote: >> Thanks Guys kernel.pid_max=4194303 did the trick. > > Great to hear! Sorry we missed that you only had it at 65536. > > This is a really common problem that people hit when their clusters start > to grow. Is there somewhere in the docs we can put this to catch more > users? Or maybe a warning issued by the osds themselves or something if > they see limits that are low? > > sage > Um, I think we can add the command to the shell script /etc/init.d/ceph. Something like we deal with the max fd limitation (ulimit -n 32768). Thus, if we use command "service ceph start osd.*" to start osds, it will be automatically changed to the proper value. >> - Karan - >> >> On 09 Mar 2015, at 14:48, Christian Eichelmann >> <christian.eichelmann@xxxxxxxx> wrote: >> >> Hi Karan, >> >> as you are actually writing in your own book, the problem is the >> sysctl >> setting "kernel.pid_max". I've seen in your bug report that you were >> setting it to 65536, which is still to low for high density hardware. >> >> In our cluster, one OSD server has in an idle situation about 66.000 >> Threads (60 OSDs per Server). The number of threads increases when you >> increase the number of placement groups in the cluster, which I think >> has triggered your problem. >> >> Set the "kernel.pid_max" setting to 4194303 (the maximum) like Azad >> Aliyar suggested, and the problem should be gone. >> >> Regards, >> Christian >> >> Am 09.03.2015 11:41, schrieb Karan Singh: >> Hello Community need help to fix a long going Ceph >> problem. >> >> Cluster is unhealthy , Multiple OSDs are DOWN. When i am >> trying to >> restart OSD?s i am getting this error >> >> >> /2015-03-09 12:22:16.312774 7f760dac9700 -1 >> common/Thread.cc >> <http://Thread.cc>: In function 'void >> Thread::create(size_t)' thread >> 7f760dac9700 time 2015-03-09 12:22:16.311970/ >> /common/Thread.cc <http://Thread.cc>: 129: FAILED >> assert(ret == 0)/ >> >> >> *Environment *: 4 Nodes , OSD+Monitor , Firefly latest , >> CentOS6.5 >> , 3.17.2-1.el6.elrepo.x86_64 >> >> Tried upgrading from 0.80.7 to 0.80.8 but no Luck >> >> Tried centOS stock kernel 2.6.32 but no Luck >> >> Memory is not a problem more then 150+GB is free >> >> >> Did any one every faced this problem ?? >> >> *Cluster status * >> * >> * >> / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/ >> / health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; >> 1 pgs >> incomplete; 1735 pgs peering; 8938 pgs stale; 1/ >> /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs >> stuck unclean; >> recovery 6061/31080 objects degraded (19/ >> /.501%); 111/196 in osds are down; clock skew detected on >> mon.pouta-s02, >> mon.pouta-s03/ >> / monmap e3: 3 mons at >> {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX >> .50.3:6789/ >> //0}, election epoch 1312, quorum 0,1,2 >> pouta-s01,pouta-s02,pouta-s03/ >> / * osdmap e26633: 239 osds: 85 up, 196 in*/ >> / pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, >> 10360 objects/ >> / 4699 GB used, 707 TB / 711 TB avail/ >> / 6061/31080 objects degraded (19.501%)/ >> / 14 down+remapped+peering/ >> / 39 active/ >> / 3289 active+clean/ >> / 547 peering/ >> / 663 stale+down+peering/ >> / 705 stale+active+remapped/ >> / 1 active+degraded+remapped/ >> / 1 stale+down+incomplete/ >> / 484 down+peering/ >> / 455 active+remapped/ >> / 3696 stale+active+degraded/ >> / 4 remapped+peering/ >> / 23 stale+down+remapped+peering/ >> / 51 stale+active/ >> / 3637 active+degraded/ >> / 3799 stale+active+clean/ >> >> *OSD : Logs * >> >> /2015-03-09 12:22:16.312774 7f760dac9700 -1 >> common/Thread.cc >> <http://Thread.cc>: In function 'void >> Thread::create(size_t)' thread >> 7f760dac9700 time 2015-03-09 12:22:16.311970/ >> /common/Thread.cc <http://Thread.cc>: 129: FAILED >> assert(ret == 0)/ >> / >> / >> / ceph version 0.80.8 >> (69eaad7f8308f21573c604f121956e64679a52a7)/ >> / 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]/ >> / 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) >> [0xae84fa]/ >> / 3: (Accepter::entry()+0x265) [0xb5c635]/ >> / 4: /lib64/libpthread.so.0() [0x3c8a6079d1]/ >> / 5: (clone()+0x6d) [0x3c8a2e89dd]/ >> / NOTE: a copy of the executable, or `objdump -rdS >> <executable>` is >> needed to interpret this./ >> >> >> *More information at Ceph Tracker Issue : >> *http://tracker.ceph.com/issues/10988#change-49018 >> >> >> **************************************************************** >> Karan Singh >> Systems Specialist , Storage Platforms >> CSC - IT Center for Science, >> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland >> mobile: +358 503 812758 >> tel. +358 9 4572001 >> fax +358 9 4572302 >> http://www.csc.fi/ >> **************************************************************** >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Christian Eichelmann >> Systemadministrator >> >> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting >> Brauerstraße 48 · DE-76135 Karlsruhe >> Telefon: +49 721 91374-8026 >> christian.eichelmann@xxxxxxxx >> >> Amtsgericht Montabaur / HRB 6484 >> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert >> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan >> Oetjen >> Aufsichtsratsvorsitzender: Michael Scheeren >> >> >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com