On 12/26/2012 11:45 AM, Nick Bartos wrote: > Here's a log with a hang on the updated branch: > > https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log I'm starting to look this over. Thanks a lot for supplying it. Sorry we still haven't nailed the problem. -Alex > > On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder <elder@xxxxxxxxxxx> wrote: >> On 12/20/2012 11:48 AM, Nick Bartos wrote: >>> Unfortunately, we still have a hang: >>> >>> https://gist.github.com/4347052/download >> >> The saga continues, and each time we get a little more >> information. Please try branch: "wip-nick-newerest" >> >> Thank you. >> >> -Alex >> >> >>> On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>>> On 12/19/2012 03:25 PM, Alex Elder wrote: >>>>> On 12/18/2012 12:05 PM, Nick Bartos wrote: >>>>>> I've added the output of "ps -ef" in addition to triggering a trace >>>>>> when a hang is detected. Not much is generally running at that point, >>>>>> but you can have a look: >>>>>> >>>>>> https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt >>>>> >>>>> This helped a lot. I updated the bug with a little more info. >>>>> >>>>> http://tracker.newdream.net/issues/3519 >>>>> >>>>> I also think I have now found something that could explain what you >>>>> are seeing, and am developing a fix. I'll provide you an update >>>>> as soon as I have tested what I come up with, almost certainly >>>>> this afternoon. >>>> >>>> Nick, I have a new branch for you to try with a new fix in place. >>>> As you might have predicted, it's named "wip-nick-newest". >>>> >>>> Please give it a try to see if it resolved the hang you've >>>> been seeing and let me know how it goes. If it continues >>>> to hang, please provide the logs as you have before, it's >>>> been very helpful. >>>> >>>> Thanks a lot. >>>> >>>> -Alex >>>>> >>>>> -Alex >>>>> >>>>>> Is it possible that there is some sort of deadlock going on? We are >>>>>> doing the rbd maps (and subsequent filesystem mounts) on the same >>>>>> systems which are running the ceph-osd and ceph-mon processes. To get >>>>>> around the 'sync' deadlock problem, we are using a patch from Sage >>>>>> which ignores system wide sync's on filesystems mounted with the >>>>>> 'mand' option (and we mount the underlying osd filesystems with >>>>>> 'mand'). However I am wondering if there is potential for other types >>>>>> of deadlocks in this environment. >>>>>> >>>>>> Also, we recently saw an rbd hang in a much older version, running >>>>>> kernel 3.5.3 with only the sync hack patch, along side ceph 0.48.1. >>>>>> It's possible that this issue was around for some time, just the >>>>>> recent patches made it happen more often (and thus more reproducible) >>>>>> for us. >>>>>> >>>>>> >>>>>> On Tue, Dec 18, 2012 at 8:09 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>>>>>> On 12/17/2012 11:12 AM, Nick Bartos wrote: >>>>>>>> Here's a log with the rbd debugging enabled: >>>>>>>> >>>>>>>> https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log >>>>>>>> >>>>>>>> On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder <elder@xxxxxxxxxxx> wrote: >>>>>>>>> On 12/14/2012 10:53 AM, Nick Bartos wrote: >>>>>>>>>> Yes I was only enabling debugging for libceph. I'm adding debugging >>>>>>>>>> for rbd as well. I'll do a repro later today when a test cluster >>>>>>>>>> opens up. >>>>>>>>> >>>>>>>>> Excellent, thank you. -Alex >>>>>>> >>>>>>> I looked through these debugging messages. Looking only at the >>>>>>> rbd debugging, what I see seems to indicate that rbd is idle at >>>>>>> the point the "hang" seems to start. This suggests that the hang >>>>>>> is not due to rbd itself, but rather whatever it is that might >>>>>>> be responsible for using the rbd image once it has been mapped. >>>>>>> >>>>>>> Is that possible? I don't know what process you have that is >>>>>>> mapping the rbd image, and what is supposed to be the next thing >>>>>>> it does. (I realize this may not make a lot of sense, given >>>>>>> a patch in rdb seems to have caused the hang to begin occurring.) >>>>>>> >>>>>>> Also note that the debugging information available (i.e., the >>>>>>> lines in the code that can output debugging information) may >>>>>>> well be incomplete. So if you don't find anything it may be >>>>>>> necessary to provide you with another update which might include >>>>>>> more debugging. >>>>>>> >>>>>>> Anyway, could you provide a little more context about what >>>>>>> is going on sort of *around* rbd when activity seems to stop? >>>>>>> >>>>>>> Thanks a lot. >>>>>>> >>>>>>> -Alex >>>>> >>>> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html