Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory

Matthew Brost <matthew.brost@xxxxxxxxx> · Wed, 11 Sep 2024 16:06:48 +0000

On Mon, Sep 02, 2024 at 05:03:20PM +0000, Matthew Brost wrote:
> On Mon, Sep 02, 2024 at 01:53:14PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > > sharing of memory between the CPU and GPU, enhancing performance and
> > > > > flexibility in GPU computing tasks.
> > > > > 
> > > > > The patch adds the necessary infrastructure for SVM, including data
> > > > > structures and functions for managing SVM ranges and notifiers. It also
> > > > > provides mechanisms for allocating, deallocating, and migrating memory
> > > > > regions between system RAM and GPU VRAM.
> > > > > 
> > > > > This mid-layer is largely inspired by GPUVM.
> > > > > 
> > > > > Cc: Dave Airlie <airlied@xxxxxxxxxx>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > > > Cc: <dri-devel@xxxxxxxxxxxxxxxxxxxxx>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > 
> > > > Still not sure I've got the right race that you paper over with
> > > > mmap_write_lock, but I spotted a few things, commments inline.
> > > > 
> > > 
> > > I've replied to this issue several times, let's table the
> > > mmap_write_lock issue in this reply - a lot of other things to get
> > > through. Current thinking is try to add a range->migrate_lock like AMD
> > > which I state here [1]. Let's continue discussing the mmap lock issue
> > > there if possible.
> > 
> > Yeah I wrote replies as I read code, so there's a bit a mess from my side
> > here. Apologies for that.
> > 
> 
> All good, has been quite helpful thus far.
> 
> > > [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169
> > 
> > Some more replies below that I think we haven't covered anywhere else yet.
> > 
> > > > > + * 2) Garbage Collector.
> > > > > + *
> > > > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > > > + *					struct drm_gpusvm_range *range)
> > > > > + *	{
> > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > + *
> > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > + *
> > > > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > > > + *		if (range->flags.partial_unmap)
> > > > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > > > 
> > > > Note that the migration back to sram isn't guaranteed to succeed, so you
> > > > might be still stuck with partially migrated range. This might be a case
> > > > where hmm gives you vram pfns, but the range you have doesn't have any
> > > > vram allocation anymore because you droppped it here. Not sure tbh.
> > > >
> > > 
> > > Hmm isn't the picture here nor will a VMA once the
> > > drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> > > might have a corner case BO refcounting / TTM resource lookup bug in
> > > somewhere in here which needs to be resolved though (e.g. eviction
> > > racing with this code path), will try to close on that.
> > > 
> > > [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
> > 
> > So maybe my understanding is wrong, but from my reading of the device
> > migration code the exact same non-guarantees as for the sram2sram
> > migration code apply:
> > 
> > - There's no guarantee the page/folio doesn't have an elevated refcount,
> >   which makes the migration fail (in try_to_migrate, where it checks for
> >   surplus refcounts).
> > 
> > - There's no guarantee you'll get the page/folio lock, which makes the
> >   migration fail. Worse the core mm seems to use a fallback to per-page
> >   locking as it's extremely crude "get out of deadlocks due to acquiring
> >   multiple page locks" card.
> >
> 
> I think this circles back to basically the design must be able to move
> VRAM -> SRAM because the host can't access VRAM. Certainly in the CPU
> page fault path this can't fail on the fauling page at least or if it
> does the app gets segfaulted. I'll investigate more here but that is
> still my current thinking. If VRAM -> SRAM can fail / make partial
> progress in eviction paths, then mixed mappings likely need to be
> supported which shouldn't be all that painful - basically just need
> cursor in the bind code which can walk mixed mappings.
> 
> SRAM -> VRAM certainly can fail which is handled by just aborting the
> migration.
> 
> > > > > +map_pages:
> > > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > +
> > > > > +		for (i = 0; i < npages; ++i) {
> > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > +
> > > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				goto err_free;
> > > > > +			}
> > > > > +		}
> > > > 
> > > > You can't do the above, because the pfn you get from hmm come with zero
> > > > guarantees, you neither hold a page reference nor the page lock. The only
> > > > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > > > check it's still valid, before you can touch any state. I think the
> > > > range->vram_allocation is probably always valid since you clean that up
> > > > under the same lock/thread, but there's good chances the vram allocation
> > > > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > > > 
> > > 
> > > I haven't seen this pop in my testing yet which is fairly thorough. My
> > > thinking was migration always being enforced at range grainularity we'd
> > > never get mixed mappings from the core as migration is completely under
> > > control of the driver. Maybe I'm not understanding what you are saying
> > > here...
> > 
> > So one scenario is that you race (without the mmap write lock or the
> > migration_mutex design ofc) with another invalidate, and get a partial
> > view here of mixed vram and sram pages. Until you acquire the mmu notifier
> > lock and have made sure your pages are still valid, there's essentially no
> > guarantee.
> 
> The pages are collected in notifier stable state via the hmm locking +
> seqno begin and recheck. Before they can used (e.g. program a bind) yes
> the notifier lock needs to be taken to ensure they haven't changed
> between collection and used - at least this my understanding.
> 
> > > 
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->flags.has_vram_pages = true;
> > > > > +		range->pages = pages;
> > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	} else {
> > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > +
> > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > > +
> > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +
> > > > > +			set_page_dirty_lock(pages[j]);
> > > > > +			mark_page_accessed(pages[j]);
> > > > 
> > > > You can't do these, because you don't hold a page reference. They're also
> > > > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > > > which takes care of these, unlike the gup family of functions.
> > > >
> > > 
> > > This is a left over from our existing userpte code and it does appear to
> > > be incorrect. Let me remove this and fixup our userptr code while I'm at
> > > it.
> > 
> > Ack.
> > 
> > > > > +	vas = vma_lookup(mm, start);
> > > > > +	if (!vas) {
> > > > > +		err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > +
> > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > +	if (!zdd) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/*
> > > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > 
> > > This is a bit stale, can update this comment.
> > > 
> > > > > +	 */
> > > > 
> > > > Yeah I think especially under contention partial migrations, at least back
> > > > to sram due to cpu faults, are pretty much expected. And you need to cope
> > > > somehow.
> > > > 
> > > 
> > > I have seen these pop if the IGT calls mlock on the memory. My thinking
> > > is migration to VRAM is basically optional and fallback to leaving range
> > > in SRAM if an error occurs rather than doing a partial migration. This
> > > is what currently happens so it is coped with.
> > > 
> > > If the memory is marked as must be in VRAM (NIY), well then the user
> > > program has done something wrong and can kill the app (akin to
> > > segfault).
> > 
> > Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.
> > 
> > > > > +
> > > > > +	if (!migrate.cpages) {
> > > > > +		err = -EFAULT;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	if (migrate.cpages != npages) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_finalize;
> > > > > +	}
> > 
> > What I think is more fundamental is that I think this one here doesn't
> > work. For migrate_to_ram you cannot assume that you can always migrate the
> > entire block, I think to uphold the core mm forward progress rules we need
> > to allow partial migrations there. And I think your current code allows
> > that.
> >
> 
> Yes. I had similar checks in migrate_to_ram at one point and that did
> not work when multiple CPU faults from different threads occured in
> parallel. Each thread can grab a random set of VRAM pages to migrate I
> think.
>  
> > But that then means you also are stuck with partial migration state here.
> > That was the point I tried to make.
> >
> 
> The error path with migrate_vma_pages/finalize safely unwinds the
> migration in these cases leaving all pages in SRAM.
> 
> > > > > +/**
> > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @vas: Pointer to the VM area structure
> > > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > > + * @start: Start address of the migration range
> > > > > + * @end: End address of the migration range
> > > > > + *
> > > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +					struct vm_area_struct *vas,
> > > > > +					struct page *page,
> > > > > +					u64 start, u64 end)
> > > > > +{
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.vma		= vas,
> > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > +		.fault_page	= page,
> > > > > +	};
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > 
> > > > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > > > called on a anon mapping from a child process. That also means that the
> > > > vma you're looking at might have no relationship with anythign you're
> > > > tracking in your gpusvm.
> > > >
> > > 
> > > Hmm, as discussed [3] I haven't added tests with child processes yet.
> > > Let me do that and update the design as needed. This likely isn't
> > > correct as you say.
> > > 
> > > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 
> > 
> > Ack. More tests should definitely help here to figure out what's up, and
> > what's just me being confused.
> > 
> 
> Starting to add tests this fork() appears to work after dropping these
> asserts. More thorough testing is needed though.
> 
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > > + * @vmf: Pointer to the fault information structure
> > > > > + *
> > > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > > + * the internal migration function to migrate the range back to RAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > + */
> > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > > +	int err;
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > 
> > > > So I think zdd->range doesn't work, because even within a single mm the
> > > > vma mapping a given piece of anon memory does not need to be unique, you
> > > > can duplicate them with mremap.
> > > > 
> > > 
> > > This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> > > similar lookup mechanism.
> > 
> > Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
> > which I think isn't ok.
> > 
> 
> Yes, this gets a little confusing with fork() and mremap. The range's
> start / end can be nonsense in the remap case. Also as you mention a
> range->migrate_mutex doesn't seem correct either. I can make it work but
> maybe not worth even typing out why here (I can provide a little more
> detail in another reply). New thinking is zdd stores a size field and
> has the locking - I think is akin to a VRAM folio then.
> 
> > > > So all you have here is the physical memory and the vma, which might or
> > > > might not be from the same process as gpusvm->mm.
> > > > 
> > > > Also the child process scenario means you using mmap_write on the fault
> > > > side doesn't stop all cpu faults migrating stuff back.
> > > > 
> > > > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > > > is busted, because it's va based and so misses concurrently ongoing
> > > > different mappings moving physical storage around underneath.
> > > >
> > > 
> > > I think all of the above which falls into the fork() + child process
> > > issues which you have raise. Until I test this out I can't speak to this
> > > any level of confidence so I won't. Thanks for raising this issue and
> > > let me write test cases as discussed and educate myself. Once I do that,
> > > we can engage in further discussions.
> > 
> > I think fork + childs will still result in zdd->range being unique (albeit
> > confused about which mm). You need mremap of some of these mappings to
> 
> Agree for fork + child based on initial testing.
> 
> > change the addresses and really cause confusion, which I /think/ (but
> > didn't test) is doable with a single process even and duplicating anon
> 
> Yep, remap changes the address so range is confusing and really size is
> sufficient aligning within VMA's start / end upon CPU fault. AMD does
> this but with a VMA search which I think is a bit overkill.
> 

Sima gave me something to investigate over the past week or so and asked
me to write up my findings and share the list. I'm replying here because
this seems like as good a place as any. 

A. Investigate possible livelock with do_swap_page taking a device page
reference and folio_migrate_mapping aborting in migrate_vma_* if
multiple references are held.

	Sima was correct in identifying this livelock. I was able to reproduce a
	stable livelock with a test where multiple CPU threads faulted the same
	device page in parallel, and an exclusive lock was taken in
	migrate_to_ram. Without an exclusive lock, forward progress is made, but
	on average, there were ~32k calls to migrate_to_ram before a thread
	succeeded. This issue appears to affect all implementations that use
	device pages.

	I have posted a patch with Sima’s suggested core MM fix on the list [1]
	and verified in the local Xe branch that this patch resolves the
	livelock and reduces multiple calls to migrate_to_ram on the same
	faulting page. It would be helpful to get AMD's input and testing on
	this patch.

B. Test out fork

	I added a few test sections to my IGT.

	This basically worked due to the COW (Copy-On-Write) semantics of fork.
	Both the parent and child processes fault on their first CPU access,
	getting their own new copy of any memory allocated before the fork.

	I believe the only change needed was dropping a lockdep assert in
	migrate_to_ram, as the MM can change.

C. MMAP shared with anonymous memory

	I found that this is actually not anonymous memory but rather
	shmem-backed [2] [3]. Only anonymous memory is available for migration,
	so the corner cases related to multiple CPU mappings for device pages
	discussed do not exist.

D. MREMAP behavior

	Added a few test sections to my IGT for possible REMAP cases.

	In all cases (DONTUNMAP, DONTUNMAP with read only...) the old
	memory generates a MMU_NOTIFY_UNMAP event. This fits nicely with
	the design as old range it just unmapped. Next CPU or GPU acess
	to old memory has zero fill semantics.

	The new memory can point to previously allocated device pages.
	With a simple update to design the next GPU fault can find these
	pages and map them.

	MREMAP did expose some problems with a zdd (device page
	zone_device_data) pointing to a range. It was pointed out that
	something physical pointing to something virtual is nonsense. I
	have fixed this locally and agree that all refs from physical to
	virtual will be dropped in the common layer.

D. MREMAP Behavior

	I added a few test sections to my IGT to explore possible MREMAP cases.

	In all cases (e.g., DONTUNMAP, DONTUNMAP with read-only...), the old
	memory generates an MMU_NOTIFY_UNMAP event. This aligns well with the
	design, as the old range is simply unmapped. Subsequent CPU or GPU
	access to the old memory has zero-fill semantics.

	The new memory can point to previously allocated device pages. With a
	simple update to the design, the next GPU fault can find these pages and
	map them.

	MREMAP did reveal some issues with zdd (device page zone_device_data)
	pointing to a range. It was pointed out that having something physical
	pointing to something virtual is nonsensical. I have fixed this locally
	and agree that all references from physical to virtual will be removed
	in the common layer.

E. Locking issues

	Sima strongly suggested not inventing locks for migration to avoid
	races, but rather to accept the core MM races. I removed all locks
	except for the existing Xe locks and eliminated mmap write abuse. With a
	more robust retry loop in the GPU page fault handler, I was able to
	successfully avoid mixed mappings. Whether mixed mappings will be
	supported is a different topic, but in my opinion, this demonstrates
	that a design can work with minimal locking. This will be the design
	moving forward.

Matt

[1] https://patchwork.freedesktop.org/series/138497/
[2] https://elixir.bootlin.com/linux/v6.10.7/source/mm/mmap.c#L2934
[3] https://elixir.bootlin.com/linux/v6.10.7/source/mm/shmem.c#L4941

> Matt
> 
> > memory mappings with mremap.
> > 
> > Cheers, Sima
> > -- 
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch