Re: [Bug 198497] handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot Null pointer

Jason Andryuk <jandryuk@xxxxxxxxx> · Fri, 20 Apr 2018 09:10:11 -0400

On Thu, Apr 12, 2018 at 1:28 PM,  <bugzilla-daemon@xxxxxxxxxxxxxxxxxxx> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>
> --- Comment #25 from willy@xxxxxxxxxxxxx ---
> On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote:
>> On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>
>> >
>> > ping?
>> >
>>
>> There have been a bunch of updates to this issue in bugzilla
>> (https://bugzilla.kernel.org/show_bug.cgi?id=198497).  Sigh, I don't
>> know what to do about this - maybe there's some way of getting bugzilla
>> to echo everything to linux-mm or something.
>>
>> Anyway, please take a look - we appear to have a bug here.  Perhaps
>> this bug is sufficiently gnarly for you to prepare a debugging patch
>> which we can add to the mainline kernel so we get (much) more debugging
>> info when people hit it?
>
> I have a few thoughts ...
>
>  - The debugging patch I prepared appears to be doing its job well.
>    People get the message and their machine stays working.
>  - The commonality appears to be Xen running 32-bit kernels.  Maybe we
>    can kick the problem over to them to solve?
>  - If we are seeing corruption purely in the lower bits, *we'll never
>    know*.  The radix tree lookup will simply not find anything, and all
>    will be well.  That said, the bad PTE values reported in that bug have
>    the NX bit and one other bit set; generally bit 32, 33 or 34.  I have
>    an idea for adding a parity bit, but haven't had time to implement it.
>    Anyone have an intern who wants an interesting kernel project to work on?
>
> Given that this is happening on Xen, I wonder if Xen is using some of the
> bits in the page table for its own purposes.

The backtraces include do_swap_page().  While I have a swap partition
configured, I don't think it's being used.  Are we somehow
misidentifying the page as a swap page?  I'm not familiar with the
code, but is there an easy way to query global swap usage?  That way
we can see if the check for a swap page is bogus.

My system works with the band-aid patch.  When that patch sets page =
NULL, does that mean userspace is just going to get a zero-ed page?
Userspace still works AFAICT, which makes me think it is a
mis-identified page to start with.

Regards,
Jason