Please find ceph.conf at [1] and the corresponding OSD log at [2]. To clarify one thing I skipped earlier on, is while bringing up the OSDs, 'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I had to temporarily disable 'journal dio' to get the disk activated (with a 'mark-init' set to none) and then explicitly start the OSD service after updating the conf to enable 'journal dio'. I am hopeful that this should not cause the present issue (since few OSD start successfully on first attempt and others on subsequent service restarts)! [1] - http://paste.openstack.org/show/411161/ [2] - http://paste.openstack.org/show/411162/ [3] - http://tracker.ceph.com/issues/9768 Regards, Unmesh G. IRC: unmeshg > -----Original Message----- > From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] > Sent: Thursday, August 06, 2015 6:22 PM > To: Gurjar, Unmesh > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: OSD sometimes stuck in init phase > > Don't find something strange. > > Could you paste your ceph.conf? And restart this osd with debug_osd=20/20, > debug_filestore=20/20 :-) > > On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh <unmesh.gurjar@xxxxxx> > wrote: > > Thanks for quick response Haomai! Please find the backtrace here [1]. > > > > [1] - http://paste.openstack.org/show/411139/ > > > > Regards, > > Unmesh G. > > IRC: unmeshg > > > >> -----Original Message----- > >> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] > >> Sent: Thursday, August 06, 2015 5:31 PM > >> To: Gurjar, Unmesh > >> Cc: ceph-devel@xxxxxxxxxxxxxxx > >> Subject: Re: OSD sometimes stuck in init phase > >> > >> Could you print your all thread callback via "thread apply all bt"? > >> > >> On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh <unmesh.gurjar@xxxxxx> > >> wrote: > >> > Hi, > >> > > >> > On a Ceph Firefly cluster (version [1]), OSDs are configured to use > >> > separate > >> data and journal disks (using the ceph-disk utility). It is observed, > >> that few OSDs start-up fine (are 'up' and 'in' state); however, > >> others are stuck in the 'init creating/touching snapmapper object' > >> phase. Below is a OSD start-up log > >> snippet: > >> > > >> > 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open > >> > /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block > >> > size > >> > 4096 bytes, directio = 1, aio = 1 > >> > 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open > >> > /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block > >> > size > >> > 4096 bytes, directio = 1, aio = 1 > >> > 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot > >> > 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock > >> > sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 > >> > a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) > >> > 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init > >> > creating/touching snapmapper object > >> > > >> > The log statement is inaccurate though, since it is actually doing > >> > init > >> operation for the 'infos' object (as can be observed from source [2]). > >> > > >> > Upon debugging further, the thread seems to be waiting to acquire > >> > the > >> 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: > >> > > >> > (gdb) where > >> > #0 0x00007fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from > >> > /lib/x86_64-linux-gnu/libpthread.so.0 > >> > #1 0x00007fd313132bf4 in > >> > ObjectStore::apply_transactions(ObjectStore::Sequencer*, > >> > std::list<ObjectStore::Transaction*, > >> > std::allocator<ObjectStore::Transaction*> >&, Context*) () > >> > #2 0x00007fd313097d08 in > >> > ObjectStore::apply_transaction(ObjectStore::Transaction&, Context*) > >> > () > >> > #3 0x00007fd313076790 in OSD::init() () > >> > #4 0x00007fd3130233a7 in main () > >> > > >> > In a few cases, upon restarting the stuck OSD (service), it > >> > successfully > >> completes the 'init' phase and reaches the 'up' and 'in' state! > >> > > >> > Any help is greatly appreciated. Please let me know if any more > >> > details are > >> required for root causing. > >> > > >> > [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) > >> > [2] - > >> > https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 > >> > > >> > Regards, > >> > Unmesh G. > >> > IRC: unmeshg > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >> > majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> > >> > >> -- > >> Best Regards, > >> > >> Wheat > > > > -- > Best Regards, > > Wheat ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f