Re: OSD sometimes stuck in init phase

Haomai Wang <haomaiwang@xxxxxxxxx> · Thu, 6 Aug 2015 20:52:26 +0800

Don't find something strange.

Could you paste your ceph.conf? And restart this osd with
debug_osd=20/20, debug_filestore=20/20 :-)

On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh <unmesh.gurjar@xxxxxx> wrote:
> Thanks for quick response Haomai! Please find the backtrace here [1].
>
> [1] - http://paste.openstack.org/show/411139/
>
> Regards,
> Unmesh G.
> IRC: unmeshg
>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
>> Sent: Thursday, August 06, 2015 5:31 PM
>> To: Gurjar, Unmesh
>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>> Subject: Re: OSD sometimes stuck in init phase
>>
>> Could you print your all thread callback via "thread apply all bt"?
>>
>> On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh <unmesh.gurjar@xxxxxx>
>> wrote:
>> > Hi,
>> >
>> > On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate
>> data and journal disks (using the ceph-disk utility). It is observed, that few OSDs
>> start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init
>> creating/touching snapmapper object' phase. Below is a OSD start-up log
>> snippet:
>> >
>> > 2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
>> > /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 1
>> > 2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
>> > /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 1
>> > 2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
>> > 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
>> > sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
>> > a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
>> > 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
>> > creating/touching snapmapper object
>> >
>> > The log statement is inaccurate though, since it is actually doing init
>> operation for the 'infos' object (as can be observed from source [2]).
>> >
>> > Upon debugging further, the thread seems to be waiting to acquire the
>> 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
>> >
>> > (gdb) where
>> > #0  0x00007fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> > /lib/x86_64-linux-gnu/libpthread.so.0
>> > #1  0x00007fd313132bf4 in
>> > ObjectStore::apply_transactions(ObjectStore::Sequencer*,
>> > std::list<ObjectStore::Transaction*,
>> > std::allocator<ObjectStore::Transaction*> >&, Context*) ()
>> > #2  0x00007fd313097d08 in
>> > ObjectStore::apply_transaction(ObjectStore::Transaction&, Context*) ()
>> > #3  0x00007fd313076790 in OSD::init() ()
>> > #4  0x00007fd3130233a7 in main ()
>> >
>> > In a few cases, upon restarting the stuck OSD (service), it successfully
>> completes the 'init' phase and reaches the 'up' and 'in' state!
>> >
>> > Any help is greatly appreciated. Please let me know if any more details are
>> required for root causing.
>> >
>> > [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>> > [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
>> >
>> > Regards,
>> > Unmesh G.
>> > IRC: unmeshg
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html