Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?

Simona Vetter <simona.vetter@xxxxxxxx> · Wed, 29 Jan 2025 12:28:25 +0100



On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
> > On 28.01.25 21:14, Simona Vetter wrote:
> > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > > > > On integrated the gpu is tied into the coherency
> > > > > > > > fabric, so there it's not needed.
> > > > > > > > 
> > > > > > > > I think the more fundamental question with both this function here and
> > > > > > > > with forced migration to device memory is that there's no guarantee it
> > > > > > > > will work out.
> > > > > > > 
> > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > > > > work reliably with THP.
> > > > > > 
> > > > > > I should have crawled through the implementation first before replying.
> > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > > > > least in theory work reliably on anon memory, and not be impacted by
> > > > > > elevated refcounts due to migration/ksm/thp/whatever.
> > > > > 
> > > > > Yes, there is -- in theory -- nothing blocking the conversion except the
> > > > > folio lock. That's different than page migration.
> > > > 
> > > > Indeed - this was the entire motivation for make_device_exclusive() - that we
> > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same
> > > > way page migration can (otherwise we could have just migrated to a device page,
> > > > even if that may have added unwanted overhead).
> > > 
> > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> > > when locking multiple folios, but I think at least on the first one we
> > > need an unconditional folio_lock to guarantee forward progress.
> > 
> > At least on the hmm path I was able to trigger the EBUSY a couple of times
> > due to concurrent swapout. But the hmm-tests selftest fails immediately
> > instead of retrying.
> 
> My worries with just retrying is that it's very hard to assess whether
> there's a livelock or whether the retry has a good chance of success. As
> an example the ->migrate_to_ram path has some trylocks, and the window
> where all other threads got halfway and then fail the trylock is big
> enough that once you pile up enough threads that spin through there,
> you're stuck forever. Which isn't great.
> 
> So if we could convert at least the first folio_trylock into a plain lock
> then forward progress is obviously assured and there's no need to crawl
> through large chunks of mm/ code to hunt for corner cases where we could
> be too unlucky to ever win the race.
> 
> > > Since
> > > atomics can't cross 4k boundaries (or the hw is just really broken) this
> > > should be enough to avoid being stuck in a livelock. I'm also not seeing
> > > any other reason why a folio_lock shouldn't work here, but then my
> > > understanding of mm/ stuff is really just scratching the surface.
> > > 
> > > I did crawl through all the other code and it looks like everything else
> > > is unconditional locks. So looks all good and I didn't spot anything else
> > > that seemed problematic.
> > > 
> > > Somewhat aside, I do wonder whether we really want to require callers to
> > > hold the mmap lock, or whether with all the work towards lockless fastpath
> > > that shouldn't instead just be an implementation detail.
> > 
> > We might be able to use the VMA lock in the future, but that will require
> > GUP support and a bunch more. Until then, the mm_lock in read mode is
> > required.
> 
> Yup. I also don't think we should try to improve before benchmarks show an
> actual need. It's more about future proofing and making sure mmap_lock
> doesn't leak into driver data structures that I'm worried about. Because
> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
> keep everything correct on the driver side, which is not a clean design.
> 
> > I was not able to convince myself that we'll really need the folio lock, but
> > that's also a separate discussion.
> 
> This is way above my pay understanding of mm/ unfortunately.

I pondered this some more, and I think it's to make sure we get a stable
reading of folio_mapcount() and are not racing with new rmaps being
established. But I also got lost a few times in the maze ...
-Sima

> 
> > > At least for the
> > > gpu hmm code I've seen I've tried to push hard towards a world were the
> > > gpu side does not rely on mmap_read_lock being held at all, to future
> > > proof this all. And currently we only have one caller of
> > > make_device_exclusive_range() so would be simple to do.
> > 
> > We could likely move the mmap_lock into that function, but avoiding it is
> > more effort.
> 
> I didn't mean more than just that, which would make sure drivers at least
> do not rely on mmap_lock being held. That then allows us to switch over to
> vma lock or anything else entirely within mm/ code.
> 
> If we leave it as-is then more drivers accidentally or intentionally will
> rely on this, like I think is the case for ->migrate_to_ram for hmm
> already. And then it's more pain to untangle.
> 
> > In any case, I'll send something out probably tomorrow to fix page
> > migration/swapout of pages with device-exclusive entries and a bunch of
> > other things (THP, interaction with hugetlb, ...).
> 
> Thanks a lot!
> 
> Cheer, Sima
> > 
> > -- 
> > Cheers,
> > 
> > David / dhildenb
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch