Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree

Mike Rapoport <rppt@xxxxxxxxxxxxx> · Wed, 9 Dec 2020 23:42:31 +0200

On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
>
> > Sorry, I was not clear. The penalty here is not the traversal of
> > memblock.reserved array. The penalty is for redundant initialization of
> > struct page for ranges memblock.reserved describes.
> 
> But that's the fix... How can you avoid removing the PagePoison for
> all pfn_valid in memblock.reserved, when the crash of
> https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@xxxxxxxxxx is
> in __SetPageReserved of reserve_bootmem_region?

With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()

> > For instance, the if user specified hugetlb_cma=nnG loop over
> > memblock.reserved will cause redundant initialization pass over the
> > pages in this nnG.
> 
> And if that didn't happen, then I'm left wondering if
> free_low_memory_core_early would crash again with specified
> hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
> applied.
> 
> Where does PagePoison get removed from the CMA range, or is somehow
> __SetPageReserved not called with the simple fix applied on the CMA
> reserved range?

CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().

> > Regardless of particular implementation, the generic code cannot detect
> > physical memory layout, this is up to architecture to detect what memory
> > banks are populated and provide this information to the generic code.
> > 
> > So we'll always have what you call "dependencies on the caller" here.
> > The good thing is that we can fix "the caller" when we find it's wrong.
> 
> The caller that massages the e820 bios table, is still
> software. Software can go from arch code to common code. There's no
> asm or arch dependency in there.
> 
> The caller in my view needs to know what e820 is and send down the raw
> info... it's a 1:1 API translation, the caller massaging of the data
> can all go in the callee and benefit all callers.

Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.

> The benefit is once the callee does all validation, then all archs
> that do a mistake will be told and boot will not fail and it'll be
> more likely to boot stable even in presence of inconsistent hardware
> tables.

That's true. Having the generic code more robust will benifit everybody.

> > > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > > I'd worry about pfn 0 itself not being added to memblock.reserved.
> > 
> > My point was that with a bit different e820 table we may again hit a
> > corner case that we didn't anticipate.
> > 
> > pfn 0 is in memblock.reserved by a coincidence and not by design and using
> > memblock.reserved to detect zone/node boundaries is not reliable.
> >
> > Moreover, if you look for usage of for_each_mem_pfn_range() in
> > page_alloc.c, it's looks very much that it is presumed to iterate over
> > *all* physical memory, including mysterious pfn 0.
> 
> All the complex patch does it that it guarantees all the reserved
> pages can get a "right" zoneid.
> 
> Then we certainly should consider rounding it down by the pageblock in
> case pfn 0 isn't reserved.
> 
> In fact if you add all memblock.reserved ranges to memblock.memory,
> you'll obtain exactly the same result in the zone_start_pfn as with
> the complex patch in the zone_start_pfn calculation and the exact same
> issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
> added to memblock.memory with { memblock_add; memblock_reserved }.
>  
> > > Still with the complex patch applied, if something goes wrong
> > > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > > non-reproducible land.
> > 
> > I still think that the complex patch is going in the wrong direction.
> > We cannot rely on memblock.reserved to calculate zones and nodes span.
> > 
> > This:
> > 
> > +       /*
> > +        * reserved regions must be included so that their page
> > +        * structure can be part of a zone and obtain a valid zoneid
> > +        * before __SetPageReserved().
> > +        */
> > +       return min(PHYS_PFN(memblock_start_of_DRAM()),
> > +                  PHYS_PFN(memblock.reserved.regions[0].base));
> > 
> > is definitely a hack that worked because "the caller" has this:
> > 
> > 	/*
> > 	 * Make sure page 0 is always reserved because on systems with
> > 	 * L1TF its contents can be leaked to user processes.
> > 	 */
> > 	memblock_reserve(0, PAGE_SIZE);
> > 
> > So, what would have happen if L1TF was not yet disclosed? ;-)
> 
> It's actually fine thing to delete the above memblock_reserve(0,
> PAGE_SIZE) for all testing so we can move into the remaining issues.

There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)

> The end result of the "hack" is precisely what you obtain if you add
> all memblock.reserved to memblock.memory, except it doesn't require to
> add memblock.reserved to memblock.memory.

I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.

And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.

However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.