On Mon, Mar 03, 2025 at 03:06:54PM +0100, David Hildenbrand wrote: > On 03.03.25 14:55, Brendan Jackman wrote: > > On Mon, Mar 03, 2025 at 02:11:23PM +0100, David Hildenbrand wrote: > > > On 03.03.25 13:13, Brendan Jackman wrote: > > > > Since the migratetype hygiene patches [0], the locking here is > > > > a bit more formalised. > > > > > > > > For other stuff, it's pretty obvious that it would be protected by the > > > > zone lock. But it didn't seem totally self-evident that it should > > > > protect the pageblock type. So it seems particularly helpful to have it > > > > written in the code. > > > > > > [...] > > > > > > > + > > > > u64 max_mem_size = U64_MAX; > > > > /* add this memory to iomem resource */ > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 579789600a3c7bfb7b0d847d51af702a9d4b139a..1ed21179676d05c66f77f9dbebf88e36bbe402e9 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -417,6 +417,10 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, > > > > void set_pageblock_migratetype(struct page *page, int migratetype) > > > > { > > > > + lockdep_assert_once(system_state == SYSTEM_BOOTING || > > > > + in_mem_hotplug() || > > > > + lockdep_is_held(&page_zone(page)->lock)); > > > > + > > > > > > I assume the call chain on the memory hotplug path is mostly > > > > > > move_pfn_range_to_zone()->memmap_init_range()->set_pageblock_migratetype() > > > > > > either when onlining a memory block, or from pagemap_range() while holding > > > the hotplug lock. > > > > > > But there is also the memmap_init_zone_device()->memmap_init_compound()->__init_zone_device_page()->set_pageblock_migratetype() > > > one, called from pagemap_range() *without* holding the hotplug lock, and you > > > assertion would be missing that. > > > > > > I'm not too happy about that assertion in general. > > > > Hmm, thanks for pointing that out. > > > > I guess if we really wanted the assertion the approach would be to > > replace in_mem_hotplug() with some more fine-grained logic about the > > state of the pageblock? But that seems like it would require rework > > that isn't really justified. > > I was wondering if we could just grab the zone lock while initializing, then > assert that we either hold that or are in boot. Would that be because you want to avoid creating in_mem_hotplug()? Or is it more about just simplifying the synchronization in general? FWIW I don't think the in_mem_hotplug() is really that bad in the assertion, it feels natural to me that memory hotplug would be an exception to the locking rules in the same way that startup would be. > In move_pfn_range_to_zone() it should likely not cause too much harm, and we > could just grab it around all zone modification stuff. > > memmap_init_zone_device() might take longer and be more problematic. > > But I am not sure why memmap_init_zone_device() would have to set the > migratetype at all? Because migratetype is a buddy concept, and > ZONE_DEVICE does not interact with the buddy to that degree. > > The comment in __init_zone_device_page states: > > "Mark the block movable so that blocks are reserved for movable at > startup. This will force kernel allocations to reserve their blocks > rather than leaking throughout the address space during boot when > many long-lived kernel allocations are made." Uh, yeah I was pretty mystified by that. It would certainly be nice if we can just get rid of this modification path. > But that just dates back to 966cf44f637e where we copy-pasted that code. > > So I wonder if we could just > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 57933683ed0d1..b95f545846e6e 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -1002,19 +1002,11 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, > page->zone_device_data = NULL; > /* > - * Mark the block movable so that blocks are reserved for > - * movable at startup. This will force kernel allocations > - * to reserve their blocks rather than leaking throughout > - * the address space during boot when many long-lived > - * kernel allocations are made. > - * > - * Please note that MEMINIT_HOTPLUG path doesn't clear memmap > - * because this is done early in section_activate() > + * Note that we leave pageblock migratetypes uninitialized, because > + * they don't apply to ZONE_DEVICE. > */ > - if (pageblock_aligned(pfn)) { > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + if (pageblock_aligned(pfn)) > cond_resched(); > - } > /* > * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released memory-model.rst says: > Since the > page reference count never drops below 1 the page is never tracked as > free memory and the page's `struct list_head lru` space is repurposed > for back referencing to the host device / driver that mapped the memory. And this code seems to assume that the whole pageblock is part of the ZONE_DEVICE dance, it would certainly make sense to me...