Re: rbd map command hangs for 15 minutes during system start up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/26/2012 11:45 AM, Nick Bartos wrote:
> Here's a log with a hang on the updated branch:
> 
> https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log

OK, new naming scheme.  Please try:  wip-nick-1

I added another simple fix, but then collapsed three commits
into one, and added one more (somewhat unrelated).

I've done simple testing with this and will subject it to
more rigorous testing shortly.  I wanted to make it available
to you quickly though.

					-Alex

> 
> On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>> On 12/20/2012 11:48 AM, Nick Bartos wrote:
>>> Unfortunately, we still have a hang:
>>>
>>> https://gist.github.com/4347052/download
>>
>> The saga continues, and each time we get a little more
>> information.  Please try branch: "wip-nick-newerest"
>>
>> Thank you.
>>
>>                                         -Alex
>>
>>
>>> On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>> On 12/19/2012 03:25 PM, Alex Elder wrote:
>>>>> On 12/18/2012 12:05 PM, Nick Bartos wrote:
>>>>>> I've added the output of "ps -ef" in addition to triggering a trace
>>>>>> when a hang is detected.  Not much is generally running at that point,
>>>>>> but you can have a look:
>>>>>>
>>>>>> https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt
>>>>>
>>>>> This helped a lot.  I updated the bug with a little more info.
>>>>>
>>>>>     http://tracker.newdream.net/issues/3519
>>>>>
>>>>> I also think I have now found something that could explain what you
>>>>> are seeing, and am developing a fix.  I'll provide you an update
>>>>> as soon as I have tested what I come up with, almost certainly
>>>>> this afternoon.
>>>>
>>>> Nick, I have a new branch for you to try with a new fix in place.
>>>> As you might have predicted, it's named "wip-nick-newest".
>>>>
>>>> Please give it a try to see if it resolved the hang you've
>>>> been seeing and let me know how it goes.  If it continues
>>>> to hang, please provide the logs as you have before, it's
>>>> been very helpful.
>>>>
>>>> Thanks a lot.
>>>>
>>>>                                         -Alex
>>>>>
>>>>>                                       -Alex
>>>>>
>>>>>> Is it possible that there is some sort of deadlock going on?  We are
>>>>>> doing the rbd maps (and subsequent filesystem mounts) on the same
>>>>>> systems which are running the ceph-osd and ceph-mon processes.  To get
>>>>>> around the 'sync' deadlock problem, we are using a patch from Sage
>>>>>> which ignores system wide sync's on filesystems mounted with the
>>>>>> 'mand' option (and we mount the underlying osd filesystems with
>>>>>> 'mand').  However I am wondering if there is potential for other types
>>>>>> of deadlocks in this environment.
>>>>>>
>>>>>> Also, we recently saw an rbd hang in a much older version, running
>>>>>> kernel 3.5.3 with only the sync hack patch, along side ceph 0.48.1.
>>>>>> It's possible that this issue was around for some time, just the
>>>>>> recent patches made it happen more often (and thus more reproducible)
>>>>>> for us.
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 18, 2012 at 8:09 AM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>>>>> On 12/17/2012 11:12 AM, Nick Bartos wrote:
>>>>>>>> Here's a log with the rbd debugging enabled:
>>>>>>>>
>>>>>>>> https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log
>>>>>>>>
>>>>>>>> On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>>>>>>> On 12/14/2012 10:53 AM, Nick Bartos wrote:
>>>>>>>>>> Yes I was only enabling debugging for libceph.  I'm adding debugging
>>>>>>>>>> for rbd as well.  I'll do a repro later today when a test cluster
>>>>>>>>>> opens up.
>>>>>>>>>
>>>>>>>>> Excellent, thank you.   -Alex
>>>>>>>
>>>>>>> I looked through these debugging messages.  Looking only at the
>>>>>>> rbd debugging, what I see seems to indicate that rbd is idle at
>>>>>>> the point the "hang" seems to start.  This suggests that the hang
>>>>>>> is not due to rbd itself, but rather whatever it is that might
>>>>>>> be responsible for using the rbd image once it has been mapped.
>>>>>>>
>>>>>>> Is that possible?  I don't know what process you have that is
>>>>>>> mapping the rbd image, and what is supposed to be the next thing
>>>>>>> it does.  (I realize this may not make a lot of sense, given
>>>>>>> a patch in rdb seems to have caused the hang to begin occurring.)
>>>>>>>
>>>>>>> Also note that the debugging information available (i.e., the
>>>>>>> lines in the code that can output debugging information) may
>>>>>>> well be incomplete.  So if you don't find anything it may be
>>>>>>> necessary to provide you with another update which might include
>>>>>>> more debugging.
>>>>>>>
>>>>>>> Anyway, could you provide a little more context about what
>>>>>>> is going on sort of *around* rbd when activity seems to stop?
>>>>>>>
>>>>>>> Thanks a lot.
>>>>>>>
>>>>>>>                                         -Alex
>>>>>
>>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux