On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote: > On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote: > > > Sorry, I was not clear. The penalty here is not the traversal of > > memblock.reserved array. The penalty is for redundant initialization of > > struct page for ranges memblock.reserved describes. > > But that's the fix... How can you avoid removing the PagePoison for > all pfn_valid in memblock.reserved, when the crash of > https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@xxxxxxxxxx is > in __SetPageReserved of reserve_bootmem_region? With your fix the pages in the memblock.reserved that intersects with memblock.memory will be initialized twice: first time in for_each_mem_pfn_range() and the second time in for_each_res_pfn_range() > > For instance, the if user specified hugetlb_cma=nnG loop over > > memblock.reserved will cause redundant initialization pass over the > > pages in this nnG. > > And if that didn't happen, then I'm left wondering if > free_low_memory_core_early would crash again with specified > hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix > applied. > > Where does PagePoison get removed from the CMA range, or is somehow > __SetPageReserved not called with the simple fix applied on the CMA > reserved range? CMA range would be in both memblock.memory and memblock.reserved. So the pages for it will be initialized in for_each_mem_pfn_range(). > > Regardless of particular implementation, the generic code cannot detect > > physical memory layout, this is up to architecture to detect what memory > > banks are populated and provide this information to the generic code. > > > > So we'll always have what you call "dependencies on the caller" here. > > The good thing is that we can fix "the caller" when we find it's wrong. > > The caller that massages the e820 bios table, is still > software. Software can go from arch code to common code. There's no > asm or arch dependency in there. > > The caller in my view needs to know what e820 is and send down the raw > info... it's a 1:1 API translation, the caller massaging of the data > can all go in the callee and benefit all callers. Well, the translation between e820 and memblock could been 1:1, but it is not. I think e820 in more broadly x86 memory initialization could benifit from more straightforward translation to memblock. For instance, I don't see why the pfn 0 should be marked as reserved in both e820 and memblock and why some of the e820 reserved areas never make it to memblock.reserved. > The benefit is once the callee does all validation, then all archs > that do a mistake will be told and boot will not fail and it'll be > more likely to boot stable even in presence of inconsistent hardware > tables. That's true. Having the generic code more robust will benifit everybody. > > > I guess before worrying of pfn 1 not being added to memblock.reserved, > > > I'd worry about pfn 0 itself not being added to memblock.reserved. > > > > My point was that with a bit different e820 table we may again hit a > > corner case that we didn't anticipate. > > > > pfn 0 is in memblock.reserved by a coincidence and not by design and using > > memblock.reserved to detect zone/node boundaries is not reliable. > > > > Moreover, if you look for usage of for_each_mem_pfn_range() in > > page_alloc.c, it's looks very much that it is presumed to iterate over > > *all* physical memory, including mysterious pfn 0. > > All the complex patch does it that it guarantees all the reserved > pages can get a "right" zoneid. > > Then we certainly should consider rounding it down by the pageblock in > case pfn 0 isn't reserved. > > In fact if you add all memblock.reserved ranges to memblock.memory, > you'll obtain exactly the same result in the zone_start_pfn as with > the complex patch in the zone_start_pfn calculation and the exact same > issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't > added to memblock.memory with { memblock_add; memblock_reserved }. > > > > Still with the complex patch applied, if something goes wrong > > > DEBUG_VM=y will make it reproducible, with the simple fix we return in > > > non-reproducible land. > > > > I still think that the complex patch is going in the wrong direction. > > We cannot rely on memblock.reserved to calculate zones and nodes span. > > > > This: > > > > + /* > > + * reserved regions must be included so that their page > > + * structure can be part of a zone and obtain a valid zoneid > > + * before __SetPageReserved(). > > + */ > > + return min(PHYS_PFN(memblock_start_of_DRAM()), > > + PHYS_PFN(memblock.reserved.regions[0].base)); > > > > is definitely a hack that worked because "the caller" has this: > > > > /* > > * Make sure page 0 is always reserved because on systems with > > * L1TF its contents can be leaked to user processes. > > */ > > memblock_reserve(0, PAGE_SIZE); > > > > So, what would have happen if L1TF was not yet disclosed? ;-) > > It's actually fine thing to delete the above memblock_reserve(0, > PAGE_SIZE) for all testing so we can move into the remaining issues. There are few more places in arch/x86/kernel/setup.c that memblock_reserve() one or several pages from address 0 :-) > The end result of the "hack" is precisely what you obtain if you add > all memblock.reserved to memblock.memory, except it doesn't require to > add memblock.reserved to memblock.memory. I still do not agree that there can be minimum between memblock_start_of_DRAM() and anything else. I think it's semantically wrong. And I really dislike addition of memblock.reserved traversals, mostly because of the areas that actually overlap. However, I do agree that adding non-overlaping part of memblock.reserved to memblock.memory will make the generic code more robust. I just don't want to do this implicitly by calling memblock_add() from memblock_reserve(). I think the cleaner way is to join them just before free_area_init() starts zone/node size detection. > Thanks, > Andrea > -- Sincerely yours, Mike.