On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > On integrated the gpu is tied into the coherency > > > > fabric, so there it's not needed. > > > > > > > > I think the more fundamental question with both this function here and > > > > with forced migration to device memory is that there's no guarantee it > > > > will work out. > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > and is only limited to anonymous memory. I have patches to at least make it > > > work reliably with THP. > > > > I should have crawled through the implementation first before replying. > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > least in theory work reliably on anon memory, and not be impacted by > > elevated refcounts due to migration/ksm/thp/whatever. > > Yes, there is -- in theory -- nothing blocking the conversion except the > folio lock. That's different than page migration. Indeed - this was the entire motivation for make_device_exclusive() - that we needed a way to reliably exclude CPU access that couldn't be blocked in the same way page migration can (otherwise we could have just migrated to a device page, even if that may have added unwanted overhead). > [...] > > > > > > Then, we seem to give up too easily if we cannot lock the folio when wanting > > > to convert to device-exclusive, which also looks rather odd. But well, maybe > > > it just works good enough in the common case, or there is some other retry > > > logic that makes it fly. > > > > I've crawled through the path to migrate pages from device memory back to > > system memory a few months ago, and found some livelock issues in there. > > Wouldn't be surprised if m_d_e has some of the same, but I didn't dig > > through it (least because intel can't use it because not-so-great hw > > design). > > Yes, that might be possible. Maybe something keeps spinning while the folio > is locked instead of properly waiting for the lock. > > ... or someone is trying to convert a tail page of a THP, in which case we > will also keep failing the conversion right now. > > > > There are other concerns I have (what if the page is pinned and access > > > outside of the user space page tables?). Maybe there was not need to handle > > > these cases so far. > > > > I think that's also ok, but might be good to document this clearly that > > concurrent direct i/o or rdma registered buffer or whatever will mess with > > this. The promise is only between the gpu and the cpu, not anything else, > > in current apis. At least to my knowledge. That is correct - we never came up with a good solution for direct i/o or rdma so the promise is only between cpu and gpu. That said direct i/o probably implies a shared filebacked mapping. We weren't convinced there was a useful use case there because anything could potentially map a file and use non-atomic CPU instructions to access shared the shared mapping anyway. > Well, the issue is that e.g., iouring fixed buffers can be accessed from the > CPU using the direct map from the CPU, not necessarily using DMA from a > device. Right. This was all targetted at private anonymous memory and it's assumed that both cpu and gpu are using atomic operations. As soon as you're talking shared non-anonymous mappings it seems impossible for userspace to guarantee nothing is accessing that memory with non-atomic ops anyway, even with a coherent interconnect. > In any case, I'm planning on adding some code-level documentation for that > and look into extending the high-level doc while at it. Thanks! Please CC me in. > > > > > So best I can do is make anonymous memory more reliable with > > > device-exclusive and fixup some of the problematic parts that I see (e.g., > > > broken page reclaim, page migration, ...). > > > > > > But before starting to cleanup+improve the existing handling of anonymous > > > memory, I was wondering if this whole thing is getting used at all. > > > > Yeah if this can be made reliable (under the limitation of only anon > > memory and only excluding userspace access) then I expect we'll need this > > for a very long time. I just had no idea whether even that is possible. > > > > What isn't good is if it's only mostly reliable, like the current > > pgmap->ops->migrate_to_ram() path in do_swap_page() still is. Right. This is _supposed_ to be 100% reliable under those limitations although I will admit I didn't consider THP deeply as that is not supported for DEVICE_PRIVATE anyway. > I'll cc you on patches once I figure out some details on how to fix some > page table walkers that really don't expect these non-swap entries. > > Fortunately, the hmm test device is in place to trigger some shaky > scenarios. > > -- > Cheers, > > David / dhildenb >