On 15.02.21 10:00, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy
weight than it needs to. Pfn walkers should normally obey node range at
least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just
do a "-m 4000M". The remaining part of the last section has fake/wrong
node/zone info.
Does this really matter though. If those pages are reserved then nobody
will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might
stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to
exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole
between two ranges associated with a different nodes/zones? This will
always be a random number. We should have a clear way to tell "do not
touch those pages" and PageReserved sounds like a good way to tell that.
Nobody should touch reserved pages, but I don't think we can ensure that.
Touching a reserved page which doesn't belong to you is a bug. Sure we
cannot enforce that rule by runtime checks. But incorrect/misleading zone/node
association is the least of the problem when somebody already does that.
We can correctly set the zone links for the reserved pages for holes in the
middle of a zone based on the architecture constraints and with only the
holes in the beginning/end of the memory will be not spanned by any
node/zone which in practice does not seem to be a problem as the VM_BUG_ON
in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory
range which is not associated with any real node.
I believe that any improvement in memory map consistency is a step forward.
I do agree but we are talking about a subtle bug (VM_BUG_ON) which would
be better of with a simplistic fix first. You can work on consistency
improvements on top of that.
problems with that. The VM_BUG_ON blowing up is really bad but as said
above we can simply make it less offensive in presence of reserved pages
as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted
in quite some ugly code. Andrea also noted that we cannot rely on any random
page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of
the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit
86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional
complexity and potentially a very involved fix for a subtle problem?
Mel who is author of this code might help us out. I have to say I do not
see the point for the VM_BUG_ON other than a better debuggability. If
there is a real incosistency problem as a result then we should be
handling that situation for non debugging kernels as well.
I have no time to summarize, you can find the complete discussion (also
involving Mel) at
https://lkml.kernel.org/r/20201121194506.13464-1-aarcange@xxxxxxxxxx
--
Thanks,
David / dhildenb