Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?

Alistair Popple <apopple@xxxxxxxxxx> · Tue, 28 Jan 2025 11:09:24 +1100

On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > On integrated the gpu is tied into the coherency
> > > > fabric, so there it's not needed.
> > > > 
> > > > I think the more fundamental question with both this function here and
> > > > with forced migration to device memory is that there's no guarantee it
> > > > will work out.
> > > 
> > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > and is only limited to anonymous memory. I have patches to at least make it
> > > work reliably with THP.
> > 
> > I should have crawled through the implementation first before replying.
> > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > least in theory work reliably on anon memory, and not be impacted by
> > elevated refcounts due to migration/ksm/thp/whatever.
> 
> Yes, there is -- in theory -- nothing blocking the conversion except the
> folio lock. That's different than page migration.

Indeed - this was the entire motivation for make_device_exclusive() - that we
needed a way to reliably exclude CPU access that couldn't be blocked in the same
way page migration can (otherwise we could have just migrated to a device page,
even if that may have added unwanted overhead).
> [...]
> 
> > 
> > > Then, we seem to give up too easily if we cannot lock the folio when wanting
> > > to convert to device-exclusive, which also looks rather odd. But well, maybe
> > > it just works good enough in the common case, or there is some other retry
> > > logic that makes it fly.
> > 
> > I've crawled through the path to migrate pages from device memory back to
> > system memory a few months ago, and found some livelock issues in there.
> > Wouldn't be surprised if m_d_e has some of the same, but I didn't dig
> > through it (least because intel can't use it because not-so-great hw
> > design).
> 
> Yes, that might be possible. Maybe something keeps spinning while the folio
> is locked instead of properly waiting for the lock.
> 
> ... or someone is trying to convert a tail page of a THP, in which case we
> will also keep failing the conversion right now.
>
> > > There are other concerns I have (what if the page is pinned and access
> > > outside of the user space page tables?). Maybe there was not need to handle
> > > these cases so far.
> > 
> > I think that's also ok, but might be good to document this clearly that
> > concurrent direct i/o or rdma registered buffer or whatever will mess with
> > this. The promise is only between the gpu and the cpu, not anything else,
> > in current apis. At least to my knowledge.

That is correct - we never came up with a good solution for direct i/o or
rdma so the promise is only between cpu and gpu. That said direct i/o probably
implies a shared filebacked mapping. We weren't convinced there was a useful use
case there because anything could potentially map a file and use non-atomic CPU
instructions to access shared the shared mapping anyway.

> Well, the issue is that e.g., iouring fixed buffers can be accessed from the
> CPU using the direct map from the CPU, not necessarily using DMA from a
> device.

Right. This was all targetted at private anonymous memory and it's assumed that
both cpu and gpu are using atomic operations. As soon as you're talking shared
non-anonymous mappings it seems impossible for userspace to guarantee nothing
is accessing that memory with non-atomic ops anyway, even with a coherent
interconnect.

> In any case, I'm planning on adding some code-level documentation for that
> and look into extending the high-level doc while at it.

Thanks! Please CC me in.

> > 
> > > So best I can do is make anonymous memory more reliable with
> > > device-exclusive and fixup some of the problematic parts that I see (e.g.,
> > > broken page reclaim, page migration, ...).
> > > 
> > > But before starting to cleanup+improve the existing handling of anonymous
> > > memory, I was wondering if this whole thing is getting used at all.
> > 
> > Yeah if this can be made reliable (under the limitation of only anon
> > memory and only excluding userspace access) then I expect we'll need this
> > for a very long time. I just had no idea whether even that is possible.
> > 
> > What isn't good is if it's only mostly reliable, like the current
> > pgmap->ops->migrate_to_ram() path in do_swap_page() still is.

Right. This is _supposed_ to be 100% reliable under those limitations although
I will admit I didn't consider THP deeply as that is not supported for
DEVICE_PRIVATE anyway.

> I'll cc you on patches once I figure out some details on how to fix some
> page table walkers that really don't expect these non-swap entries.
> 
> Fortunately, the hmm test device is in place to trigger some shaky
> scenarios.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>