On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote: > I don't see why we need all this complexity when a simple fixup was > enough. Note the memblock bug crashed my systems: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); Please note how zone_spans_pfn expands: static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn) { return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone); } For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0. How is the above not going to trigger a false positive once again with the simple fix? Not just pfn 0 in fact, any pfn reserved that happens to be the at the start of each nid. Doesn't the simple fix reintroduce the same broken invariant that was meant to be fixed by the patch that required the simple fix or it won't boot? > > + /* > > + * memblock.reserved isn't enforced to overlap with > > + * memblock.memory so initialize the struct pages for > > + * memblock.reserved too in case it wasn't overlapping. > > + * > > + * If any struct page associated with a memblock.reserved > > + * range isn't overlapping with a zone range, it'll be left > > + * uninitialized, ideally with PagePoison, and it'll be a more > > + * easily detectable error. > > + */ > > + for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > > + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > > + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > > + > > + if (end_pfn > start_pfn) > > + pgcnt += init_unavailable_range(start_pfn, end_pfn, > > + zone, nid); > > + } > > This means we are going iterate over all memory allocated before > free_area_ini() from memblock extra time. One time here and another time > in reserve_bootmem_region(). > And this can be substantial for CMA and alloc_large_system_hash(). The above loops over memory.reserved only. Now the simple fix also has to loops over some memblock.reserved or it still wouldn't boot. You worry about memory.reserved which is bound by RAM. I worry about holes that aren't bound by RAM. The simple fix won't just restrict itself to the RAM in memblock.reserved, it'll also go goes over the holes (that's all it can do in order to cover memblock.reserved ranges without looking at it). The simple fix doesn't only go over pfn 0, it can keep looping from last page in the previous nid to the first page in the next nid across regions that weren't even in memblock.reserved and that may be larger than any memory reserved region. Reserved regions aren't holes, so the complex fix is more likely to work stable since unlike the simple fix, it won't waste any CPU over any non-RAM hole that isn't part of memblock.reserved nor memblock.memory. In addition the "static unsigned long hole_start_pfn" of the simple fix also pushed complexity into the common code caller: it made memmap_init non thread safe and in addition it required the memmap_init to be called in paddr physical order or it won't even boot anymore. Now maybe it's safe now, but these memmap already gets initialized in the background during boot, so the moment more than one thread does it in parallel the simple patch will break because it's non-reentrant. > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm > > */ > > unsigned long __init find_min_pfn_with_active_regions(void) > > { > > - return PHYS_PFN(memblock_start_of_DRAM()); > > + /* > > + * reserved regions must be included so that their page > > + * structure can be part of a zone and obtain a valid zoneid > > + * before __SetPageReserved(). > > + */ > > + return min(PHYS_PFN(memblock_start_of_DRAM()), > > + PHYS_PFN(memblock.reserved.regions[0].base)); > > So this implies that reserved memory starts before memory. Don't you > find this weird? That happens because the current API doesn't require overlap. for (i = 0; i < e820_table->nr_entries; i++) { struct e820_entry *entry = &e820_table->entries[i]; end = entry->addr + entry->size; if (end != (resource_size_t)end) continue; if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; memblock_add(entry->addr, entry->size); } The above is a more complex way to write the below, which will make it more clear there's never overlap: if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN) memblock_add(entry->addr, entry->size); If you want to enforce that there is always overlap and in turn a zone cannot start with a reserved pfn, then you could add a bugcheck if memblock.reserved doesn't overlap to an already registered memblock.memory range, and then the above e820 code above will break. I thought the reason there couldn't be full overlap is the direct mapping issue so there's a very good reason for it. So given there's current zero overlap, I'm not exactly sure what's weird that memblock.reserved starts before memblock.memory. The fact pfn 0 is reserved is a bios thing, weird or not weird it's not in our control. I don't see any alternative on how to give a consistent zoneid to pfn 0, if zoneid 0 still starts at pfn 1 and you're not going to enforce full overlap between memblock.reserved and memblock.memory either. One only alternative I can see is to leave PagePoison in there and add PagePoison or PageReserved checks all over the place starting from reserve_bootmem_region like this: void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) { if (pfn_valid(start_pfn)) { struct page *page = pfn_to_page(start_pfn); + if (PagePoison(page)) + continue init_reserved_page(start_pfn); The above is still preferable since then we could then add PagePoison checks all over the VM code like this: VM_BUG_ON(!PagePoison(page) && page_to_pfn(page) >= page_zone(page)->zone_start_pfn) The above PagePoison mention assumes we'd need to extend PagePoison to DEBUG_VM=n or switch to PageReserved or a NO_NODEID. The whole point is we don't need to do any of the above since zones can include all reserved pages too so once the kernel finished booting, it'll be simpler if we have a valid zoneid so the invariant bugchecks will pass for all valid pfn and they'll stop crashing kernels with false positives. Keeping all complexity in memblock common code and keeping the arch caller raw is not a bad thing since memblock is common code and it'll benefit all archs. I wouldn't go in the direction of 124049decbb121ec32742c94fb5d9d6bed8f24d8 once again. Thanks, Andrea