On 3/4/15 4:45 PM, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. >>>>>>> >>>>>>> Strangely, I've reproduced this on >>>>>>> >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs >>>>>>> >>>>>>> but haven't yet managed to reproduce on either of its parents >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try >>>>>>> again. >>>>>> >>>>>> I think you'll find that the bug is only triggered after that XFS >>>>>> merge because it's what enabled block layout support in the server, >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to >>>>>> it's export ops. >>>>> >>>>> Doh--after all the discussion I didn't actually pay attention to what >>>>> happened in the end. OK, I see, you're right, it's all more-or-less >>>>> dead code till that merge. >>>>> >>>>> Christoph's code was passing all my tests before that, so maybe we >>>>> broke something in the merge process. >>>>> >>>>> Alternatively, it could be because I've added more tests--I'll rerun my >>>>> current tests on his original branch.... >>>> >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look >>>> very informative. I'm running xfstests over NFSv4.1 with client and >>>> server running the same kernel, the filesystem in question is xfs, but >>>> isn't otherwise available to the client (so the client shouldn't be >>>> doing pnfs). >>>> >>>> --b. >>>> >>>> BUG: unable to handle kernel paging request at 00000000757d4900 >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 >>>> PGD 0 >>>> Thread overran stack, or stack corrupted >>> >>> Hmmmm. That is not at all informative, especially as it's only >>> dumped the interrupt stack and not the stack or the task that it >>> has detected as overrun or corrupted. >>> >>> Can you turn on all the stack overrun debug options? Maybe even >>> turn on the stack tracer to get an idea of whether we are recursing >>> deeply somewhere we shouldn't be? >> >> Digging around under "Kernel hacking".... I already have >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try >> turning on the latter. (Will I be able to get information out of it >> before the panic?) > > just keep taking samples of the worst case stack usage as the test > runs. If there's anything unusual before the failure then it will > show up, otherwise I'm not sure how to track this down... I think it should print "maximum stack depth" messages whenever a stack reaches a new max excursion... > Cheers, > > Dave. > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html