On Mon, 9 Mar 2015, Karan Singh wrote: > Thanks Guys kernel.pid_max=4194303 did the trick. Great to hear! Sorry we missed that you only had it at 65536. This is a really common problem that people hit when their clusters start to grow. Is there somewhere in the docs we can put this to catch more users? Or maybe a warning issued by the osds themselves or something if they see limits that are low? sage > - Karan - > > On 09 Mar 2015, at 14:48, Christian Eichelmann > <christian.eichelmann@xxxxxxxx> wrote: > > Hi Karan, > > as you are actually writing in your own book, the problem is the > sysctl > setting "kernel.pid_max". I've seen in your bug report that you were > setting it to 65536, which is still to low for high density hardware. > > In our cluster, one OSD server has in an idle situation about 66.000 > Threads (60 OSDs per Server). The number of threads increases when you > increase the number of placement groups in the cluster, which I think > has triggered your problem. > > Set the "kernel.pid_max" setting to 4194303 (the maximum) like Azad > Aliyar suggested, and the problem should be gone. > > Regards, > Christian > > Am 09.03.2015 11:41, schrieb Karan Singh: > Hello Community need help to fix a long going Ceph > problem. > > Cluster is unhealthy , Multiple OSDs are DOWN. When i am > trying to > restart OSD?s i am getting this error > > > /2015-03-09 12:22:16.312774 7f760dac9700 -1 > common/Thread.cc > <http://Thread.cc>: In function 'void > Thread::create(size_t)' thread > 7f760dac9700 time 2015-03-09 12:22:16.311970/ > /common/Thread.cc <http://Thread.cc>: 129: FAILED > assert(ret == 0)/ > > > *Environment *: 4 Nodes , OSD+Monitor , Firefly latest , > CentOS6.5 > , 3.17.2-1.el6.elrepo.x86_64 > > Tried upgrading from 0.80.7 to 0.80.8 but no Luck > > Tried centOS stock kernel 2.6.32 but no Luck > > Memory is not a problem more then 150+GB is free > > > Did any one every faced this problem ?? > > *Cluster status * > * > * > / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/ > / health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; > 1 pgs > incomplete; 1735 pgs peering; 8938 pgs stale; 1/ > /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs > stuck unclean; > recovery 6061/31080 objects degraded (19/ > /.501%); 111/196 in osds are down; clock skew detected on > mon.pouta-s02, > mon.pouta-s03/ > / monmap e3: 3 mons at > {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX > .50.3:6789/ > //0}, election epoch 1312, quorum 0,1,2 > pouta-s01,pouta-s02,pouta-s03/ > / * osdmap e26633: 239 osds: 85 up, 196 in*/ > / pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, > 10360 objects/ > / 4699 GB used, 707 TB / 711 TB avail/ > / 6061/31080 objects degraded (19.501%)/ > / 14 down+remapped+peering/ > / 39 active/ > / 3289 active+clean/ > / 547 peering/ > / 663 stale+down+peering/ > / 705 stale+active+remapped/ > / 1 active+degraded+remapped/ > / 1 stale+down+incomplete/ > / 484 down+peering/ > / 455 active+remapped/ > / 3696 stale+active+degraded/ > / 4 remapped+peering/ > / 23 stale+down+remapped+peering/ > / 51 stale+active/ > / 3637 active+degraded/ > / 3799 stale+active+clean/ > > *OSD : Logs * > > /2015-03-09 12:22:16.312774 7f760dac9700 -1 > common/Thread.cc > <http://Thread.cc>: In function 'void > Thread::create(size_t)' thread > 7f760dac9700 time 2015-03-09 12:22:16.311970/ > /common/Thread.cc <http://Thread.cc>: 129: FAILED > assert(ret == 0)/ > / > / > / ceph version 0.80.8 > (69eaad7f8308f21573c604f121956e64679a52a7)/ > / 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]/ > / 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) > [0xae84fa]/ > / 3: (Accepter::entry()+0x265) [0xb5c635]/ > / 4: /lib64/libpthread.so.0() [0x3c8a6079d1]/ > / 5: (clone()+0x6d) [0x3c8a2e89dd]/ > / NOTE: a copy of the executable, or `objdump -rdS > <executable>` is > needed to interpret this./ > > > *More information at Ceph Tracker Issue : > *http://tracker.ceph.com/issues/10988#change-49018 > > > **************************************************************** > Karan Singh > Systems Specialist , Storage Platforms > CSC - IT Center for Science, > Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland > mobile: +358 503 812758 > tel. +358 9 4572001 > fax +358 9 4572302 > http://www.csc.fi/ > **************************************************************** > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Christian Eichelmann > Systemadministrator > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > Brauerstraße 48 · DE-76135 Karlsruhe > Telefon: +49 721 91374-8026 > christian.eichelmann@xxxxxxxx > > Amtsgericht Montabaur / HRB 6484 > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan > Oetjen > Aufsichtsratsvorsitzender: Michael Scheeren > > > >
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com