Re: rbd map command hangs for 15 minutes during system start up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



So far basic things are working fine, and my hang test is at 78 passes
and still going good.  I'll let you know if any problems crop up with
it.

On Mon, Dec 31, 2012 at 10:22 AM, Alex Elder <elder@xxxxxxxxxxx> wrote:
> On 12/26/2012 03:36 PM, Alex Elder wrote:
>> On 12/26/2012 11:45 AM, Nick Bartos wrote:
>>> Here's a log with a hang on the updated branch:
>>>
>>> https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
>>
>> OK, new naming scheme.  Please try:  wip-nick-1
>
> Now that we've got this resolved, I've created an updated
> "stable" branch with ceph-related bug fixes, based on the
> latest 3.5 stable branch, 3.5.7.  It contains a bunch of
> other bug fixes that what you had been working with did
> not have.
>
> I'm starting my own testing with this branch now.  But it
> would be great if you'd give it a try as well, since I
> know you're a "real" user of this code base.
>
> It's available as branch "linux-3.5.7-ceph" on the
> ceph-client git repository.  Thanks a lot.
>
>                                         -Alex
>
>>
>> I added another simple fix, but then collapsed three commits
>> into one, and added one more (somewhat unrelated).
>>
>> I've done simple testing with this and will subject it to
>> more rigorous testing shortly.  I wanted to make it available
>> to you quickly though.
>>
>>                                       -Alex
>>
>>>
>>> On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>> On 12/20/2012 11:48 AM, Nick Bartos wrote:
>>>>> Unfortunately, we still have a hang:
>>>>>
>>>>> https://gist.github.com/4347052/download
>>>>
>>>> The saga continues, and each time we get a little more
>>>> information.  Please try branch: "wip-nick-newerest"
>>>>
>>>> Thank you.
>>>>
>>>>                                         -Alex
>>>>
>>>>
>>>>> On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>>>> On 12/19/2012 03:25 PM, Alex Elder wrote:
>>>>>>> On 12/18/2012 12:05 PM, Nick Bartos wrote:
>>>>>>>> I've added the output of "ps -ef" in addition to triggering a trace
>>>>>>>> when a hang is detected.  Not much is generally running at that point,
>>>>>>>> but you can have a look:
>>>>>>>>
>>>>>>>> https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt
>>>>>>>
>>>>>>> This helped a lot.  I updated the bug with a little more info.
>>>>>>>
>>>>>>>     http://tracker.newdream.net/issues/3519
>>>>>>>
>>>>>>> I also think I have now found something that could explain what you
>>>>>>> are seeing, and am developing a fix.  I'll provide you an update
>>>>>>> as soon as I have tested what I come up with, almost certainly
>>>>>>> this afternoon.
>>>>>>
>>>>>> Nick, I have a new branch for you to try with a new fix in place.
>>>>>> As you might have predicted, it's named "wip-nick-newest".
>>>>>>
>>>>>> Please give it a try to see if it resolved the hang you've
>>>>>> been seeing and let me know how it goes.  If it continues
>>>>>> to hang, please provide the logs as you have before, it's
>>>>>> been very helpful.
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>>>                                         -Alex
>>>>>>>
>>>>>>>                                       -Alex
>>>>>>>
>>>>>>>> Is it possible that there is some sort of deadlock going on?  We are
>>>>>>>> doing the rbd maps (and subsequent filesystem mounts) on the same
>>>>>>>> systems which are running the ceph-osd and ceph-mon processes.  To get
>>>>>>>> around the 'sync' deadlock problem, we are using a patch from Sage
>>>>>>>> which ignores system wide sync's on filesystems mounted with the
>>>>>>>> 'mand' option (and we mount the underlying osd filesystems with
>>>>>>>> 'mand').  However I am wondering if there is potential for other types
>>>>>>>> of deadlocks in this environment.
>>>>>>>>
>>>>>>>> Also, we recently saw an rbd hang in a much older version, running
>>>>>>>> kernel 3.5.3 with only the sync hack patch, along side ceph 0.48.1.
>>>>>>>> It's possible that this issue was around for some time, just the
>>>>>>>> recent patches made it happen more often (and thus more reproducible)
>>>>>>>> for us.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 18, 2012 at 8:09 AM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>>>>>>> On 12/17/2012 11:12 AM, Nick Bartos wrote:
>>>>>>>>>> Here's a log with the rbd debugging enabled:
>>>>>>>>>>
>>>>>>>>>> https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder <elder@xxxxxxxxxxx> wrote:
>>>>>>>>>>> On 12/14/2012 10:53 AM, Nick Bartos wrote:
>>>>>>>>>>>> Yes I was only enabling debugging for libceph.  I'm adding debugging
>>>>>>>>>>>> for rbd as well.  I'll do a repro later today when a test cluster
>>>>>>>>>>>> opens up.
>>>>>>>>>>>
>>>>>>>>>>> Excellent, thank you.   -Alex
>>>>>>>>>
>>>>>>>>> I looked through these debugging messages.  Looking only at the
>>>>>>>>> rbd debugging, what I see seems to indicate that rbd is idle at
>>>>>>>>> the point the "hang" seems to start.  This suggests that the hang
>>>>>>>>> is not due to rbd itself, but rather whatever it is that might
>>>>>>>>> be responsible for using the rbd image once it has been mapped.
>>>>>>>>>
>>>>>>>>> Is that possible?  I don't know what process you have that is
>>>>>>>>> mapping the rbd image, and what is supposed to be the next thing
>>>>>>>>> it does.  (I realize this may not make a lot of sense, given
>>>>>>>>> a patch in rdb seems to have caused the hang to begin occurring.)
>>>>>>>>>
>>>>>>>>> Also note that the debugging information available (i.e., the
>>>>>>>>> lines in the code that can output debugging information) may
>>>>>>>>> well be incomplete.  So if you don't find anything it may be
>>>>>>>>> necessary to provide you with another update which might include
>>>>>>>>> more debugging.
>>>>>>>>>
>>>>>>>>> Anyway, could you provide a little more context about what
>>>>>>>>> is going on sort of *around* rbd when activity seems to stop?
>>>>>>>>>
>>>>>>>>> Thanks a lot.
>>>>>>>>>
>>>>>>>>>                                         -Alex
>>>>>>>
>>>>>>
>>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux