On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > > On 28.01.25 21:14, Simona Vetter wrote: > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > > will work out. > > > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > > work reliably with THP. > > > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > > folio lock. That's different than page migration. > > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > > way page migration can (otherwise we could have just migrated to a device page, > > > > even if that may have added unwanted overhead). > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > > when locking multiple folios, but I think at least on the first one we > > > need an unconditional folio_lock to guarantee forward progress. > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times > > due to concurrent swapout. But the hmm-tests selftest fails immediately > > instead of retrying. > > My worries with just retrying is that it's very hard to assess whether > there's a livelock or whether the retry has a good chance of success. As > an example the ->migrate_to_ram path has some trylocks, and the window > where all other threads got halfway and then fail the trylock is big > enough that once you pile up enough threads that spin through there, > you're stuck forever. Which isn't great. > > So if we could convert at least the first folio_trylock into a plain lock > then forward progress is obviously assured and there's no need to crawl > through large chunks of mm/ code to hunt for corner cases where we could > be too unlucky to ever win the race. > > > > Since > > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > > any other reason why a folio_lock shouldn't work here, but then my > > > understanding of mm/ stuff is really just scratching the surface. > > > > > > I did crawl through all the other code and it looks like everything else > > > is unconditional locks. So looks all good and I didn't spot anything else > > > that seemed problematic. > > > > > > Somewhat aside, I do wonder whether we really want to require callers to > > > hold the mmap lock, or whether with all the work towards lockless fastpath > > > that shouldn't instead just be an implementation detail. > > > > We might be able to use the VMA lock in the future, but that will require > > GUP support and a bunch more. Until then, the mm_lock in read mode is > > required. > > Yup. I also don't think we should try to improve before benchmarks show an > actual need. It's more about future proofing and making sure mmap_lock > doesn't leak into driver data structures that I'm worried about. Because > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to > keep everything correct on the driver side, which is not a clean design. > > > I was not able to convince myself that we'll really need the folio lock, but > > that's also a separate discussion. > > This is way above my pay understanding of mm/ unfortunately. I pondered this some more, and I think it's to make sure we get a stable reading of folio_mapcount() and are not racing with new rmaps being established. But I also got lost a few times in the maze ... -Sima > > > > At least for the > > > gpu hmm code I've seen I've tried to push hard towards a world were the > > > gpu side does not rely on mmap_read_lock being held at all, to future > > > proof this all. And currently we only have one caller of > > > make_device_exclusive_range() so would be simple to do. > > > > We could likely move the mmap_lock into that function, but avoiding it is > > more effort. > > I didn't mean more than just that, which would make sure drivers at least > do not rely on mmap_lock being held. That then allows us to switch over to > vma lock or anything else entirely within mm/ code. > > If we leave it as-is then more drivers accidentally or intentionally will > rely on this, like I think is the case for ->migrate_to_ram for hmm > already. And then it's more pain to untangle. > > > In any case, I'll send something out probably tomorrow to fix page > > migration/swapout of pages with device-exclusive entries and a bunch of > > other things (THP, interaction with hugetlb, ...). > > Thanks a lot! > > Cheer, Sima > > > > -- > > Cheers, > > > > David / dhildenb > > > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch