>> I dislike this for three reasons >> >> a) It does not protect against any races, really, it does not improve things. >> b) We do have the exact same problem with pfn_to_online_page(). As long as we >> don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy. > > True, we need to solve that problem too. That seems to want something > lighter weight than the hotplug lock that can be held over pfn lookups > + use rather than requiring a page lookup in paths where it's not > clear that a page reference would prevent unplug. > >> c) We mix in ZONE specific stuff into the core. It should be "just another zone" > > Not sure I grok this when the RFC is sprinkling zone-specific > is_zone_device_page() throughout the core? Most users should not care about the zone. pfn_active() would be enough in most situations, especially most PFN walkers - "this memmap is valid and e.g., contains a valid zone ...". > >> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87) > > Sorry I missed this earlier... > >> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap >> 3. Introduce pfn_active() that checks against the subsection bitmap >> 4. Once the memmap was initialized / prepared, set the subsection active >> (similar to SECTION_IS_ONLINE in the buddy right now) >> 5. Before the memmap gets invalidated, set the subsection inactive >> (similar to SECTION_IS_ONLINE in the buddy right now) >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE > > This does not seem to reduce any complexity because it still requires > a pfn to zone lookup at the end of the process. > > I.e. converting pfn_to_online_page() to use a new pfn_active() > subsection map plus looking up the zone from pfn_to_page() is more > steps than just doing a direct pfn to zone lookup. What am I missing? That a real "pfn to zone" lookup without going via the struct page will require to have more than just a single bitmap. IMHO, keeping the information at a single place (memmap) is the clean thing to do (not replicating it somewhere else). Going via the memmap might not be as fast as a direct lookup, but do we actually care? We are already looking at "random PFNs we are not sure if there is a valid memmap". >> >> Especially, driver-reserved device memory will not get set active in >> the subsection bitmap. >> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do >> wonder if we can use RCU: > > Ah, yes, exactly what I was thinking above. > >> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page(): >> >> /* the memmap is guaranteed to remain active under RCU */ >> rcu_read_lock(); >> if (pfn_active(random_pfn)) { >> page = pfn_to_page(random_pfn); >> ... use the page, stays valid >> } >> rcu_unread_lock(); >> >> Memory offlining/memremap code: >> >> set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */ >> synchronize_rcu(); >> /* all users saw the bitmap update, we can invalide the memmap */ >> remove_pfn_range_from_zone(zone, pfn, nr_pages); > > Looks good to me. > >> >>> >>>> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested >>>> on x86-64 and PPC. >>> >>> I'll give it a spin, but I don't think the kernel wants to grow more >>> is_zone_device_page() users. >> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only. >> The other parts can rely on pfn_to_online_page() only. >> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" >> - Basically never used with ZONE_DEVICE. >> - We hold a reference! >> - All it protects is a SetPageDirty(page); >> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" >> - Same as 1. >> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about >> references (I assume this should be fine as we don't come via random >> PFNs) >> - We check that we don't mix Reserved (including device memory) and CMA >> pages when crossing compound pages. >> >> I think we can drop 1. and 2., resulting in a total of 2 new users in >> the same context. I think that is totally tolerable to finally clean >> this up. > > ...but more is_zone_device_page() doesn't "finally clean this up". > Like we discussed above it's the missing locking that's the real > cleanup, the pfn_to_online_page() internals are secondary. It's a different cleanup IMHO. We can't do everything in one shot. But maybe I can drop the is_zone_device_page() parts from this patch and completely rely on pfn_to_online_page(). Yes, that needs fixing to, but it's a different story. The important part of this patch: While pfn_to_online_page() will always exclude ZONE_DEVICE pages, checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is racy as hell (especially when concurrently initializing the memmap). This does improve the situation. >> >> However, I think we also have to clarify if we need the change in 3 at all. >> It comes from >> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16 >> Author: Kees Cook <keescook@xxxxxxxxxxxx> >> Date: Tue Jun 7 11:05:33 2016 -0700 >> >> mm: Hardened usercopy >> >> This is the start of porting PAX_USERCOPY into the mainline kernel. This >> is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The >> work is based on code by PaX Team and Brad Spengler, and an earlier port >> from Casey Schaufler. Additional non-slab page tests are from Rik van Riel. >> [...] >> - otherwise, object must not span page allocations (excepting Reserved >> and CMA ranges) >> >> Not sure if we really have to care about ZONE_DEVICE at this point. > > That check needs to be careful to ignore ZONE_DEVICE pages. There's > nothing wrong with a copy spanning ZONE_DEVICE and typical pages. Please note that the current check would *forbid* this (AFAIKs for a single heap object). As discussed in the relevant patch, we might be able to just stop doing that and limit it to real PG_reserved pages (without ZONE_DEVICE). I'd be happy to not introduce new is_zone_device_page() users. -- Thanks, David / dhildenb