Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > I don't see why we need all this complexity when a simple fixup was
> > enough.
> 
> Note the memblock bug crashed my systems:
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> Please note how zone_spans_pfn expands:
> 
> static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> {
> 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> }
> 
> For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.

I don't know why you keep calling it "memblock bug" as pfn 0 was not
part of DMA zone at least for ten years.

If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
x86 it will happily trigger there.

The problem, as I see it, in, hmm, questionable semantics for
!E820_TYPE_RAM ranges on x86.

I keep asking this simple question again and again and I still have no
clear "yes" or "no" answer:

	Is PFN 0 on x86 memory?

Let's step back for a second and instead of blaming memblock in lack of
enforcement for .memory and .reserved to overlap and talking about how
memblock core was not robust enough we'll get to the point where we have
clear defintions of what e820 ranges represent.

> > > +	/*
> > > +	 * memblock.reserved isn't enforced to overlap with
> > > +	 * memblock.memory so initialize the struct pages for
> > > +	 * memblock.reserved too in case it wasn't overlapping.
> > > +	 *
> > > +	 * If any struct page associated with a memblock.reserved
> > > +	 * range isn't overlapping with a zone range, it'll be left
> > > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > > +	 * easily detectable error.
> > > +	 */
> > > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > +
> > > +		if (end_pfn > start_pfn)
> > > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > > +							zone, nid);
> > > +	}
> > 
> > This means we are going iterate over all memory allocated before
> > free_area_ini() from memblock extra time. One time here and another time
> > in reserve_bootmem_region().
> > And this can be substantial for CMA and alloc_large_system_hash().
> 
> The above loops over memory.reserved only.

Right, this loops over memory.reserved which happens to include CMA and
the memory allocated in alloc_large_system_hash(). This can be a lot.
And memmap_init() runs before threads are available so this cannot be
parallelized.
The deferred memmap initialization starts much later than this.

> > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> > >   */
> > >  unsigned long __init find_min_pfn_with_active_regions(void)
> > >  {
> > > -	return PHYS_PFN(memblock_start_of_DRAM());
> > > +	/*
> > > +	 * reserved regions must be included so that their page
> > > +	 * structure can be part of a zone and obtain a valid zoneid
> > > +	 * before __SetPageReserved().
> > > +	 */
> > > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> > 
> > So this implies that reserved memory starts before memory. Don't you
> > find this weird?
> 
> The above is a more complex way to write the below, which will make it
> more clear there's never overlap:
> 
> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> 			memblock_reserve(entry->addr, entry->size);
> 		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
> 			memblock_add(entry->addr, entry->size);

So what about E820_TYPE_ACPI or E820_TYPE_NVS?
How should they be treated?

It happend that we memblock_reserve() the first page, but what if the
BIOS marks pfn 1 as E820_TYPE_ACPI? It's neither in .memory nor in
.reserved. Does it belong to zone 0 and node 0?

Or, if, say, node 1 starts at 4G and the BIOS claimed several first pages
of node 1 as type 20?
What should we do then?

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux