Re: [Bugme-new] [Bug 15709] New: swapper page allocation failure

Robert Wimmer <kernel@xxxxxxxxxxx> · Thu, 13 May 2010 23:08:45 +0200

Finally I've had some time to do the next test.
Here is a wireshark dump (~750 MByte):
http://213.252.12.93/2.6.34-rc5.cap.gz

dmesg output after page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26371

stack trace before page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26369

stack trace after page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26370

I hope the wireshark dump is not to big to download.
It was created with
tshark -f "tcp port 2049" -i eth0 -w 2.6.34-rc5.cap

Thanks!
Robert

On 05/06/10 23:30, Trond Myklebust wrote:
> Sorry. I've been caught up in work in the past few days.
>
> I can certainly help with the soft lockup if you are able to supply
> either a dump that includes all threads stuck in the NFS, or a (binary)
> wireshark dump that shows the NFSv4 traffic between the client and
> server around the time of the hang.
>
> Cheers
>   Trond
>
> On Thu, 2010-05-06 at 23:19 +0200, Robert Wimmer wrote: 
>   
>> I don't know if someone is still interested in this
>> but I think Trond isn't further interested because
>> the last error was of cource a "page allocation
>> failure" and not a "soft lookup" which Trond was
>> trying to solve. But the patch was for 2.6.34 and
>> the "soft lookup" comes up only with some 2.6.30 and
>> maybe some 2.6.31 kernel versions. But the first error
>> I reported was a "page allocation failure" which
>> all kernels >= 2.6.32 produces with this configuration
>> I use (NFSv4).
>>
>> Michael suggested to first solve the "soft lookup"
>> before further investigating the "page allocation
>> failure". We know that the "soft lookup" only
>> pop's up with NFSv4 and not v3. I really want to
>> use v4 but since I'm not a kernel hacker someone
>> must guide me what to try next.
>>
>> I know that you're all have a lot of other work to
>> do but if there're no ideas left what to do next
>> it's maybe best to close the bug for now and I stay with
>> kernel 2.6.30 for now or go back to NFS v3 if I
>> upgrade to a newer kernel. Maybe the error will
>> be fixed "by accident" in >= 2.6.35 ;-) 
>>
>> Thanks!
>> Robert
>>
>>
>>
>> On 05/03/10 10:11, kernel@xxxxxxxxxxx wrote:
>>     
>>> Anything we can do to investigate this further?
>>>
>>> Thanks!
>>> Robert
>>>
>>>
>>> On Wed, 28 Apr 2010 00:56:01 +0200, Robert Wimmer <kernel@xxxxxxxxxxx>
>>> wrote:
>>>   
>>>       
>>>> I've applied the patch against the kernel which I got
>>>> from "git clone ...." resulted in a kernel 2.6.34-rc5.
>>>>
>>>> The stack trace after mounting NFS is here:
>>>> https://bugzilla.kernel.org/attachment.cgi?id=26166
>>>> /var/log/messages after soft lockup:
>>>> https://bugzilla.kernel.org/attachment.cgi?id=26167
>>>>
>>>> I hope that there is any usefull information in there.
>>>>
>>>> Thanks!
>>>> Robert
>>>>
>>>> On 04/27/10 01:28, Trond Myklebust wrote:
>>>>     
>>>>         
>>>>> On Tue, 2010-04-27 at 00:18 +0200, Robert Wimmer wrote: 
>>>>>   
>>>>>       
>>>>>           
>>>>>>> Sure. In addition to what you did above, please do
>>>>>>>
>>>>>>> mount -t debugfs none /sys/kernel/debug
>>>>>>>
>>>>>>> and then cat the contents of the pseudofile at
>>>>>>>
>>>>>>> /sys/kernel/debug/tracing/stack_trace
>>>>>>>
>>>>>>> Please do this more or less immediately after you've finished
>>>>>>>           
>>>>>>>               
>>> mounting
>>>   
>>>       
>>>>>>> the NFSv4 client.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> I've uploaded the stack trace. It was generated
>>>>>> directly after mounting. Here are the stacks:
>>>>>>
>>>>>> After mounting:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26153
>>>>>> After the soft lockup:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26154
>>>>>> The dmesg output of the soft lockup:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26155
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> Does your server have the 'crossmnt' or 'nohide' flags set, or does
>>>>>>>           
>>>>>>>               
>>> it
>>>   
>>>       
>>>>>>> use the 'refer' export option anywhere? If so, then we might have to
>>>>>>> test further, since those may trigger the NFSv4 submount feature.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> The server has the following settings:
>>>>>> rw,nohide,insecure,async,no_subtree_check,no_root_squash
>>>>>>
>>>>>> Thanks!
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> That second trace is more than 5.5K deep, more than half of which is
>>>>> socket overhead :-(((.
>>>>>
>>>>> The process stack does not appear to have overflowed, however that
>>>>>       
>>>>>           
>>> trace
>>>   
>>>       
>>>>> doesn't include any IRQ stack overhead.
>>>>>
>>>>> OK... So what happens if we get rid of half of that trace by forcing
>>>>> asynchronous tasks such as this to run entirely in rpciod instead of
>>>>> first trying to run in the process context?
>>>>>
>>>>> See the attachment...
>>>>>
>>>>>       
>>>>>           
>>     
>
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html