On Fri, Jul 13, 2018 at 8:04 PM Pavel Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > > You can't just memset() the 'struct page' to zero after it's been set up. > > That should not be happening, unless there is a bug. Well, it does seem to happen. My memory stress-tester has been running for about half an hour now with the revert I posted - it used to trigger the problem in maybe ~5 minutes before. So I do think that revert fixes it for me. No guarantees, but since I figured out how to trigger it, it's been fairly reliable. > We want to zero those struct pages so we do not have uninitialized > data accessed by various parts of the code that rounds down large > pages and access the first page in section without verifying that the > page is valid. The example of this is described in commit that > introduced zero_resv_unavail() I'm attaching the relevant (?) parts of dmesg, which has the node ranges, maybe you can see what the problem with the code is. (NOTE! This dmesg is with that "mem=6G" command line option, which causes that e820: remove [mem 0x180000000-0xfffffffffffffffe] usable line - that's just because it's my stress-test boot. It happens with or without it, but without the "mem=6G" it took days to trigger). I'm more than willing to test patches (either for added information or for testing fixes), although I think I'm getting off the computer for today. Linus
Attachment:
dmesg.out
Description: Binary data