On Thu, Jan 30, 2025 at 04:43:08PM +0100, David Hildenbrand wrote: > > > Assume you have a THP (or any mTHP today). You can easily trigger the > > > scenario that folio_mapcount() != 0 with active device-exclusive entries, > > > and you start doing rmap walks and stumble over these device-exclusive > > > entries and *not* handle them properly. Note that more and more systems are > > > configured to just give you THP unless you explicitly opted-out using > > > MADV_NOHUGEPAGE early. > > > > > > Note that b756a3b5e7ea added that hunk that still walks these > > > device-exclusive entries in rmap code, but didn't actually update the rmap > > > walkers: > > > > > > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > > > > > > /* Handle un-addressable ZONE_DEVICE memory */ > > > entry = pte_to_swp_entry(*pvmw->pte); > > > - if (!is_device_private_entry(entry)) > > > + if (!is_device_private_entry(entry) && > > > + !is_device_exclusive_entry(entry)) > > > return false; > > > > > > pfn = swp_offset(entry); > > > > > > That was the right thing to do, because they resemble PROT_NONE entries and > > > not migration entries or anything else that doesn't hold a folio reference). > > > > Yeah I got that part. What I meant is that doubling down on this needs a > > full audit and cannot rely on "we already have device private entries > > going through these paths for much longer", which was the impression I > > got. I guess it worked, thanks for doing that below :-) > > I know I know, I shouldn't have touched it ... :) > > So yeah, I'll spend some extra work on sorting out the other cases. Thanks :-) > > And at least from my very rough understanding of mm, at least around all > > this gpu stuff, tracking device exclusive mappings like real cpu mappings > > makes sense, they do indeed act like PROT_NONE with some magic to restore > > access on fault. > > > > I do wonder a bit though what else is all not properly tracked because > > they should be like prot_none except arent. I guess we'll find those as we > > hit them :-/ > > Likely a lot of stuff. But more in a "entry gets ignored -- functionality > not implemented, move along" way, because all page table walkers have to > care about !pte_present() already; it's just RMAP code that so far never > required it. I think it'd be good to include a tersion summary of this in the commit messages - I'd expect this is code other gpu folks will need to crawl through in the future again, and I had no idea where I should even start looking to figure this out. > > [...] > > > > > > If thp constantly reassembles a pmd entry because hey all the > > > > memory is contig and userspace allocated a chunk of memory to place > > > > atomics that alternate between cpu and gpu nicely separated by 4k pages, > > > > then we'll thrash around invalidating ptes to no end. So might be more > > > > fallout here. > > > > > > khugepaged will back off once it sees an exclusive entry, so collapsing > > > could only happen once everything is non-exclusive. See > > > __collapse_huge_page_isolate() as an example. > > > > Ah ok. I think might be good to add that to the commit message, so that > > people who don't understand mm deeply (like me) aren't worried when they > > stumble over this change in the future again when digging around. > > Will do, thanks for raising that concern! > > > > > > It's really only page_vma_mapped_walk() callers that are affected by this > > > change, not any other page table walkers. > > > > I guess my mm understanding is just not up to that, but I couldn't figure > > out why just looking at page_vma_mapped_walk() only is good enough? > > See above: these never had to handle !page_present() before -- in contrast > to the other page table walkers. > > So nothing bad happens when these page table walkers traverse these PTEs, > it's just that the functionality will usually be implemented. > > Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() will > simply skip "!pte_present()", because it wouldn't know what to do in that > case. > > Of course, there could be page table walkers that check all cases and bail > out if they find something unexpected: do_swap_page() cannot make forward > progress and will inject a VM_FAULT_SIGBUS if it doesn't recognize the > entry. But these are rather rare. Yeah this all makes sense to me now. Thanks a lot for your explanation, I'll try to pay it back by trying to review the next version of the series a bit. > We could enlighten selected page table walkers to handle device-exclusive > where it really makes sense later. I think rmap for eviction/migration is really the big one that obviously should be fixed. All the other cases I could think of I think just end up in handle_mm_fault() to sort out the situation and then retry. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch