On Thu, Apr 12, 2018 at 1:28 PM, <bugzilla-daemon@xxxxxxxxxxxxxxxxxxx> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > --- Comment #25 from willy@xxxxxxxxxxxxx --- > On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote: >> On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >> > >> > ping? >> > >> >> There have been a bunch of updates to this issue in bugzilla >> (https://bugzilla.kernel.org/show_bug.cgi?id=198497). Sigh, I don't >> know what to do about this - maybe there's some way of getting bugzilla >> to echo everything to linux-mm or something. >> >> Anyway, please take a look - we appear to have a bug here. Perhaps >> this bug is sufficiently gnarly for you to prepare a debugging patch >> which we can add to the mainline kernel so we get (much) more debugging >> info when people hit it? > > I have a few thoughts ... > > - The debugging patch I prepared appears to be doing its job well. > People get the message and their machine stays working. > - The commonality appears to be Xen running 32-bit kernels. Maybe we > can kick the problem over to them to solve? > - If we are seeing corruption purely in the lower bits, *we'll never > know*. The radix tree lookup will simply not find anything, and all > will be well. That said, the bad PTE values reported in that bug have > the NX bit and one other bit set; generally bit 32, 33 or 34. I have > an idea for adding a parity bit, but haven't had time to implement it. > Anyone have an intern who wants an interesting kernel project to work on? > > Given that this is happening on Xen, I wonder if Xen is using some of the > bits in the page table for its own purposes. The backtraces include do_swap_page(). While I have a swap partition configured, I don't think it's being used. Are we somehow misidentifying the page as a swap page? I'm not familiar with the code, but is there an easy way to query global swap usage? That way we can see if the check for a swap page is bogus. My system works with the band-aid patch. When that patch sets page = NULL, does that mean userspace is just going to get a zero-ed page? Userspace still works AFAICT, which makes me think it is a mis-identified page to start with. Regards, Jason