Finally I've had some time to do the next test. Here is a wireshark dump (~750 MByte): http://213.252.12.93/2.6.34-rc5.cap.gz dmesg output after page allocation failure: https://bugzilla.kernel.org/attachment.cgi?id=26371 stack trace before page allocation failure: https://bugzilla.kernel.org/attachment.cgi?id=26369 stack trace after page allocation failure: https://bugzilla.kernel.org/attachment.cgi?id=26370 I hope the wireshark dump is not to big to download. It was created with tshark -f "tcp port 2049" -i eth0 -w 2.6.34-rc5.cap Thanks! Robert On 05/06/10 23:30, Trond Myklebust wrote: > Sorry. I've been caught up in work in the past few days. > > I can certainly help with the soft lockup if you are able to supply > either a dump that includes all threads stuck in the NFS, or a (binary) > wireshark dump that shows the NFSv4 traffic between the client and > server around the time of the hang. > > Cheers > Trond > > On Thu, 2010-05-06 at 23:19 +0200, Robert Wimmer wrote: > >> I don't know if someone is still interested in this >> but I think Trond isn't further interested because >> the last error was of cource a "page allocation >> failure" and not a "soft lookup" which Trond was >> trying to solve. But the patch was for 2.6.34 and >> the "soft lookup" comes up only with some 2.6.30 and >> maybe some 2.6.31 kernel versions. But the first error >> I reported was a "page allocation failure" which >> all kernels >= 2.6.32 produces with this configuration >> I use (NFSv4). >> >> Michael suggested to first solve the "soft lookup" >> before further investigating the "page allocation >> failure". We know that the "soft lookup" only >> pop's up with NFSv4 and not v3. I really want to >> use v4 but since I'm not a kernel hacker someone >> must guide me what to try next. >> >> I know that you're all have a lot of other work to >> do but if there're no ideas left what to do next >> it's maybe best to close the bug for now and I stay with >> kernel 2.6.30 for now or go back to NFS v3 if I >> upgrade to a newer kernel. Maybe the error will >> be fixed "by accident" in >= 2.6.35 ;-) >> >> Thanks! >> Robert >> >> >> >> On 05/03/10 10:11, kernel@xxxxxxxxxxx wrote: >> >>> Anything we can do to investigate this further? >>> >>> Thanks! >>> Robert >>> >>> >>> On Wed, 28 Apr 2010 00:56:01 +0200, Robert Wimmer <kernel@xxxxxxxxxxx> >>> wrote: >>> >>> >>>> I've applied the patch against the kernel which I got >>>> from "git clone ...." resulted in a kernel 2.6.34-rc5. >>>> >>>> The stack trace after mounting NFS is here: >>>> https://bugzilla.kernel.org/attachment.cgi?id=26166 >>>> /var/log/messages after soft lockup: >>>> https://bugzilla.kernel.org/attachment.cgi?id=26167 >>>> >>>> I hope that there is any usefull information in there. >>>> >>>> Thanks! >>>> Robert >>>> >>>> On 04/27/10 01:28, Trond Myklebust wrote: >>>> >>>> >>>>> On Tue, 2010-04-27 at 00:18 +0200, Robert Wimmer wrote: >>>>> >>>>> >>>>> >>>>>>> Sure. In addition to what you did above, please do >>>>>>> >>>>>>> mount -t debugfs none /sys/kernel/debug >>>>>>> >>>>>>> and then cat the contents of the pseudofile at >>>>>>> >>>>>>> /sys/kernel/debug/tracing/stack_trace >>>>>>> >>>>>>> Please do this more or less immediately after you've finished >>>>>>> >>>>>>> >>> mounting >>> >>> >>>>>>> the NFSv4 client. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I've uploaded the stack trace. It was generated >>>>>> directly after mounting. Here are the stacks: >>>>>> >>>>>> After mounting: >>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26153 >>>>>> After the soft lockup: >>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26154 >>>>>> The dmesg output of the soft lockup: >>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26155 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Does your server have the 'crossmnt' or 'nohide' flags set, or does >>>>>>> >>>>>>> >>> it >>> >>> >>>>>>> use the 'refer' export option anywhere? If so, then we might have to >>>>>>> test further, since those may trigger the NFSv4 submount feature. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> The server has the following settings: >>>>>> rw,nohide,insecure,async,no_subtree_check,no_root_squash >>>>>> >>>>>> Thanks! >>>>>> Robert >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> That second trace is more than 5.5K deep, more than half of which is >>>>> socket overhead :-(((. >>>>> >>>>> The process stack does not appear to have overflowed, however that >>>>> >>>>> >>> trace >>> >>> >>>>> doesn't include any IRQ stack overhead. >>>>> >>>>> OK... So what happens if we get rid of half of that trace by forcing >>>>> asynchronous tasks such as this to run entirely in rpciod instead of >>>>> first trying to run in the process context? >>>>> >>>>> See the attachment... >>>>> >>>>> >>>>> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html