Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree

Mike Rapoport <rppt@xxxxxxxxxxxxx> · Tue, 8 Dec 2020 23:46:14 +0200

On Mon, Dec 07, 2020 at 09:57:37PM -0500, Andrea Arcangeli wrote:
> On Mon, Dec 07, 2020 at 06:50:37PM +0200, Mike Rapoport wrote:
> > On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> > > On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > > > I don't see why we need all this complexity when a simple fixup was
> > > > enough.
> > > 
> > > Note the memblock bug crashed my systems:
> > > 
> > > 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > > 
> > > Please note how zone_spans_pfn expands:
> > > 
> > > static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> > > {
> > > 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> > > }
> > > 
> > > For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.
> > 
> > I don't know why you keep calling it "memblock bug" as pfn 0 was not
> > part of DMA zone at least for ten years.
> > 
> > If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
> > x86 it will happily trigger there.
> 
> I guess it wasn't considered a bug until your patch that was meant to
> enforce this invariant for all pfn_valid was merged in -mm?

What was not considered a bug? That node 0 and DMA zone on x86 started
at pfn 1 and pfn 0 zone/node links were set by chance to the right
values?

There was no intention to engfoce any invariant of overlaping or not
ovelapping ranges in memblock.memory and memblock.reserved.

The change in 73a6e474cb376921a311786652782155eac2fdf0 was to change the
loop over all pfns in a zone that among other things checked if that pfn
is in memblock.memory to a loop over intersection of pfns in a
zone with memblock.memory.

This exposed that before 73a6e474cb376921a311786652782155eac2fdf0:
* small holes, like your "7a17b000-7a216fff : Unknown E820 type" get
their zone/node links right because of an optimizaiton in
__early_pfn_to_nid() that didn't check if a pfn is in memblock.memory
for every pfn but rather cached the results.
* on x86 pfn 0 was neither in node 0 nor in DMA zone because they both
started with pfn 1; it got zone/node links set to 0 because that's what
init_unavailable_range() did for all pfns in "unavailable ranges".

So, if you add something like this

static int check_pfn0(void)
{
	unsigned long pfn = 0;
	struct page *page = pfn_to_page(pfn);

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

	return 0;
}
late_initcall(check_pfn0);

to an older kernel, this will blow up.

So, my guess is that we've never got to this check for pfn 0 and I
really don't know how your system crashed with both my patches applied,
unless you had a patch that checked pfn 0.

> > The problem, as I see it, in, hmm, questionable semantics for
> > !E820_TYPE_RAM ranges on x86.
> > 
> > I keep asking this simple question again and again and I still have no
> > clear "yes" or "no" answer:
> > 
> > 	Is PFN 0 on x86 memory?
> 
> The pfn is the same regardless if it's RAM or non RAM, so I don't
> think it makes any difference in this respect if it's RAM or not.
> 
> > Let's step back for a second and instead of blaming memblock in lack of
> > enforcement for .memory and .reserved to overlap and talking about how
> > memblock core was not robust enough we'll get to the point where we have
> > clear defintions of what e820 ranges represent.
> 
> I don't see why we should go back into the direction that the caller
> needs improvement like in commit
> 124049decbb121ec32742c94fb5d9d6bed8f24d8 and to try to massage the raw
> e820 data before giving it to the linux common code.

That's the entire point that the e820 data is massaged before it gets to
the common code. 124049decbb121ec32742c94fb5d9d6bed8f24d8 tried to alter
the way the raw e820 was transformed before going to the common code and
my feeling it was a shot in the dark.

The current abstraction we have in the common code to describe physical
memory is memblock and I want to make it explicit that every range in
e820 table is a part of system memory and hence a part of
memblock.memory.

> Even if you wanted to enforce overlap at all times (not possible
> because of the direct mapping issue) you could do it simply by calling
> memblock_add first thing in memblock_reserve (if it wasn't already
> fully overlapping) without the need to add an explicit memblock_add in
> the caller.

The purpose here is not to enforce overlap at all times [and it is
possible despite the direct mapping issues ;-) ].
The idea is to have the abstract description of the physical memory
(i.e. memblock) as close as possible to actual layout. The "simple" code
in e820__memblock_setup() only made things obfuscated and inconsistent.

If you have memory populated at address 0 you would expect that
for_each_mem_pfn_range() will start from pfn 0.

In addition, and implicit call to memblock_add() in memblock_reserve()
would be a waste of CPU time as most of the times memblock_reserve() is
called as the part of memory allocation before page allocator is ready.

> > > > > +	/*
> > > > > +	 * memblock.reserved isn't enforced to overlap with
> > > > > +	 * memblock.memory so initialize the struct pages for
> > > > > +	 * memblock.reserved too in case it wasn't overlapping.
> > > > > +	 *
> > > > > +	 * If any struct page associated with a memblock.reserved
> > > > > +	 * range isn't overlapping with a zone range, it'll be left
> > > > > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > > > > +	 * easily detectable error.
> > > > > +	 */
> > > > > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > > > +
> > > > > +		if (end_pfn > start_pfn)
> > > > > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > > > > +							zone, nid);
> > > > > +	}
> > > > 
> > > > This means we are going iterate over all memory allocated before
> > > > free_area_ini() from memblock extra time. One time here and another time
> > > > in reserve_bootmem_region().
> > > > And this can be substantial for CMA and alloc_large_system_hash().
> > > 
> > > The above loops over memory.reserved only.
> > 
> > Right, this loops over memory.reserved which happens to include CMA and
> > the memory allocated in alloc_large_system_hash(). This can be a lot.
> > And memmap_init() runs before threads are available so this cannot be
> > parallelized.
> > The deferred memmap initialization starts much later than this.
> 
> Can you convert the memblock.memory and memblock.reserved to an
> interval tree? interval tree shall work perfectly since the switch to
> long mode, no need of dynamic RAM allocations for its inner
> working. There's no need to keep memblock a O(N) list.

Sorry, I was not clear. The penalty here is not the traversal of
memblock.reserved array. The penalty is for redundant initialization of
struct page for ranges memblock.reserved describes.
For instance, the if user specified hugetlb_cma=nnG loop over
memblock.reserved will cause redundant initialization pass over the
pages in this nnG.

> Such change would benefit the code even if we remove the loop over
> memblock.reserved.
> 
> > > > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> > > > >   */
> > > > >  unsigned long __init find_min_pfn_with_active_regions(void)
> > > > >  {
> > > > > -	return PHYS_PFN(memblock_start_of_DRAM());
> > > > > +	/*
> > > > > +	 * reserved regions must be included so that their page
> > > > > +	 * structure can be part of a zone and obtain a valid zoneid
> > > > > +	 * before __SetPageReserved().
> > > > > +	 */
> > > > > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > > > > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> > > > 
> > > > So this implies that reserved memory starts before memory. Don't you
> > > > find this weird?
> > > 
> > > The above is a more complex way to write the below, which will make it
> > > more clear there's never overlap:
> > > 
> > > 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > > 			memblock_reserve(entry->addr, entry->size);
> > > 		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
> > > 			memblock_add(entry->addr, entry->size);
> > 
> > So what about E820_TYPE_ACPI or E820_TYPE_NVS?
> > How should they be treated?
> > 
> > It happend that we memblock_reserve() the first page, but what if the
> > BIOS marks pfn 1 as E820_TYPE_ACPI? It's neither in .memory nor in
> > .reserved. Does it belong to zone 0 and node 0?
> > 
> > Or, if, say, node 1 starts at 4G and the BIOS claimed several first pages
> > of node 1 as type 20?
> > What should we do then?
> 
> I see we still have dependencies on the caller to send down good data,
> which would be ideal to remove.

Regardless of particular implementation, the generic code cannot detect
physical memory layout, this is up to architecture to detect what memory
banks are populated and provide this information to the generic code.

So we'll always have what you call "dependencies on the caller" here.
The good thing is that we can fix "the caller" when we find it's wrong.

> I guess before worrying of pfn 1 not being added to memblock.reserved,
> I'd worry about pfn 0 itself not being added to memblock.reserved.

My point was that with a bit different e820 table we may again hit a
corner case that we didn't anticipate.

pfn 0 is in memblock.reserved by a coincidence and not by design and using
memblock.reserved to detect zone/node boundaries is not reliable.

Moreover, if you look for usage of for_each_mem_pfn_range() in
page_alloc.c, it's looks very much that it is presumed to iterate over
*all* physical memory, including mysterious pfn 0.

> Still with the complex patch applied, if something goes wrong
> DEBUG_VM=y will make it reproducible, with the simple fix we return in
> non-reproducible land.

I still think that the complex patch is going in the wrong direction.
We cannot rely on memblock.reserved to calculate zones and nodes span.

This:

+       /*
+        * reserved regions must be included so that their page
+        * structure can be part of a zone and obtain a valid zoneid
+        * before __SetPageReserved().
+        */
+       return min(PHYS_PFN(memblock_start_of_DRAM()),
+                  PHYS_PFN(memblock.reserved.regions[0].base));

is definitely a hack that worked because "the caller" has this:

	/*
	 * Make sure page 0 is always reserved because on systems with
	 * L1TF its contents can be leaked to user processes.
	 */
	memblock_reserve(0, PAGE_SIZE);

So, what would have happen if L1TF was not yet disclosed? ;-)

Anyway, I think we need somthing like:

* add everything in e820 to memblock.memory while keeping any
!E820_TYPE_RAM out of the direct map
* refactor initialization of struct page for holes in memory layout to
have proper state for struct pages not backed by actual memory
* add validation of memory/nodes/zones to make sure the generic code can
deal with data that architecture supplied; this maybe only for
DEBUG_VM=y case.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.