On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > Before that change, the memmap of memory holes were only zeroed > out. So the zones/nid was 0, however, pages were not reserved and > had a refcount of zero - resulting in other issues. So maybe that "0,0" zoneid/nid was not actually the thing that introduced the regression? Note: I didn't bisect anything yet, it was just a guess. In fact, I need first thing to boot with an old v5.3 kernel and to dump the page->flags around the "Unknown E820 type" region to be 100% certain that the older kernels had a correct page->flags for reserved pages. I will clear this up before the end of the day. > Most pfn walkers shouldn‘t mess with reserved pages and simply skip > them. That would be the right fix here. What would then be the advantage of keeping wrong page->flags in reserved pages compared to actually set it correct? How do you plan to enforce that every VM bit will call PageReserved (as it should be introduced in pageblock_pfn_to_page in that case) if it's not possible to disambiguate a real DMA zone and nid 0 from the uninitialized value? How can we patch page_nodenum to enforce it won't read an uninitialized value because a PageReserved check was missing in the caller? I don't see this easily enforceable by code review, it'd be pretty flakey to leave a 0,0 wrong value in there with no way to verify if a PageResered check in the caller was missing. > > It'd be preferable if the pfn_valid fails and the > > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > > step. Even better memset 0xff over the whole page struct until the > > second stage comes around. > > I recently discussed with Baoquan to > 1. Using a new pagetype to mark memory holes > 2. Using special nid/zid to mark memory holes > > Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON. What would need to call pfn_zone in between first and second stage? If something calls pfn_zone in between first and second stage isn't it a feature if it crashes the kernel at boot? Note: I suggested 0xff kernel crashing "until the second stage comes around" during meminit at boot, not permanently. /* * Use a fake node/zone (0) for now. Some of these pages * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ Specifically I relied on the comment "get re-initialized via reserve_bootmem_region() later". I assumed the second stage overwrites the 0,0 to the real zoneid/nid value, which is clearly not happening, hence it'd be preferable to get a crash at boot reliably. Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage calling init_reserved_page(start_pfn) won't do much with CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the page->flags were still wrong for reserved pages in the "Unknown E820 type" region. > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > > all times, otherwise if the second stage fails we end up in a bug with > > weird side effects. > > Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? zoneid is always a pure abstract concept, not just the RAM-holes but the RAM itself doesn't have a zoneid at all. Nid is hardware defined for RAM, but for reserved RAM holes it becomes a pure software construct as the zoneid. So while it's right that they have no zone/nid in the hardware, they still have a very well defined zoneid/nid in the software implementation. The page struct belongs to one and only one mem_map array, that has a specific nodeid and nid. So it can be always initialized right even for non-RAM. Only if the pfn is !pfn_valid, then it has no page struct and then there's no zone/nid association, but in that case there's no reason to even worry about it since nobody can possibly get a wrong value out of the page struct because there's no page struct in the case. Last but not the least, RAM pages can be marked reserved and assigned to hardware and so it'd be really messy if real reserved RAM pages given to hw, behaved different than non-RAM that accidentally got a struct page because of section alignment issues. Thanks, Andrea