Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory

Matthew Brost <matthew.brost@xxxxxxxxx> · Mon, 2 Sep 2024 17:03:20 +0000

On Mon, Sep 02, 2024 at 01:53:14PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > sharing of memory between the CPU and GPU, enhancing performance and
> > > > flexibility in GPU computing tasks.
> > > > 
> > > > The patch adds the necessary infrastructure for SVM, including data
> > > > structures and functions for managing SVM ranges and notifiers. It also
> > > > provides mechanisms for allocating, deallocating, and migrating memory
> > > > regions between system RAM and GPU VRAM.
> > > > 
> > > > This mid-layer is largely inspired by GPUVM.
> > > > 
> > > > Cc: Dave Airlie <airlied@xxxxxxxxxx>
> > > > Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > > Cc: <dri-devel@xxxxxxxxxxxxxxxxxxxxx>
> > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > 
> > > Still not sure I've got the right race that you paper over with
> > > mmap_write_lock, but I spotted a few things, commments inline.
> > > 
> > 
> > I've replied to this issue several times, let's table the
> > mmap_write_lock issue in this reply - a lot of other things to get
> > through. Current thinking is try to add a range->migrate_lock like AMD
> > which I state here [1]. Let's continue discussing the mmap lock issue
> > there if possible.
> 
> Yeah I wrote replies as I read code, so there's a bit a mess from my side
> here. Apologies for that.
> 

All good, has been quite helpful thus far.

> > [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169
> 
> Some more replies below that I think we haven't covered anywhere else yet.
> 
> > > > + * 2) Garbage Collector.
> > > > + *
> > > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > > + *					struct drm_gpusvm_range *range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > > + *		if (range->flags.partial_unmap)
> > > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > > 
> > > Note that the migration back to sram isn't guaranteed to succeed, so you
> > > might be still stuck with partially migrated range. This might be a case
> > > where hmm gives you vram pfns, but the range you have doesn't have any
> > > vram allocation anymore because you droppped it here. Not sure tbh.
> > >
> > 
> > Hmm isn't the picture here nor will a VMA once the
> > drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> > might have a corner case BO refcounting / TTM resource lookup bug in
> > somewhere in here which needs to be resolved though (e.g. eviction
> > racing with this code path), will try to close on that.
> > 
> > [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
> 
> So maybe my understanding is wrong, but from my reading of the device
> migration code the exact same non-guarantees as for the sram2sram
> migration code apply:
> 
> - There's no guarantee the page/folio doesn't have an elevated refcount,
>   which makes the migration fail (in try_to_migrate, where it checks for
>   surplus refcounts).
> 
> - There's no guarantee you'll get the page/folio lock, which makes the
>   migration fail. Worse the core mm seems to use a fallback to per-page
>   locking as it's extremely crude "get out of deadlocks due to acquiring
>   multiple page locks" card.
>

I think this circles back to basically the design must be able to move
VRAM -> SRAM because the host can't access VRAM. Certainly in the CPU
page fault path this can't fail on the fauling page at least or if it
does the app gets segfaulted. I'll investigate more here but that is
still my current thinking. If VRAM -> SRAM can fail / make partial
progress in eviction paths, then mixed mappings likely need to be
supported which shouldn't be all that painful - basically just need
cursor in the bind code which can walk mixed mappings.

SRAM -> VRAM certainly can fail which is handled by just aborting the
migration.

> > > > +map_pages:
> > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +		for (i = 0; i < npages; ++i) {
> > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_free;
> > > > +			}
> > > > +		}
> > > 
> > > You can't do the above, because the pfn you get from hmm come with zero
> > > guarantees, you neither hold a page reference nor the page lock. The only
> > > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > > check it's still valid, before you can touch any state. I think the
> > > range->vram_allocation is probably always valid since you clean that up
> > > under the same lock/thread, but there's good chances the vram allocation
> > > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > > 
> > 
> > I haven't seen this pop in my testing yet which is fairly thorough. My
> > thinking was migration always being enforced at range grainularity we'd
> > never get mixed mappings from the core as migration is completely under
> > control of the driver. Maybe I'm not understanding what you are saying
> > here...
> 
> So one scenario is that you race (without the mmap write lock or the
> migration_mutex design ofc) with another invalidate, and get a partial
> view here of mixed vram and sram pages. Until you acquire the mmu notifier
> lock and have made sure your pages are still valid, there's essentially no
> guarantee.

The pages are collected in notifier stable state via the hmm locking +
seqno begin and recheck. Before they can used (e.g. program a bind) yes
the notifier lock needs to be taken to ensure they haven't changed
between collection and used - at least this my understanding.

> > 
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->flags.has_vram_pages = true;
> > > > +		range->pages = pages;
> > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	} else {
> > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +		for_each_dma_page(i, j, npages, order) {
> > > > +			if (WARN_ON_ONCE(i && order !=
> > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			set_page_dirty_lock(pages[j]);
> > > > +			mark_page_accessed(pages[j]);
> > > 
> > > You can't do these, because you don't hold a page reference. They're also
> > > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > > which takes care of these, unlike the gup family of functions.
> > >
> > 
> > This is a left over from our existing userpte code and it does appear to
> > be incorrect. Let me remove this and fixup our userptr code while I'm at
> > it.
> 
> Ack.
> 
> > > > +	vas = vma_lookup(mm, start);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > +		err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (!vma_is_anonymous(vas)) {
> > > > +		err = -EBUSY;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > +	if (!zdd) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/*
> > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > 
> > This is a bit stale, can update this comment.
> > 
> > > > +	 */
> > > 
> > > Yeah I think especially under contention partial migrations, at least back
> > > to sram due to cpu faults, are pretty much expected. And you need to cope
> > > somehow.
> > > 
> > 
> > I have seen these pop if the IGT calls mlock on the memory. My thinking
> > is migration to VRAM is basically optional and fallback to leaving range
> > in SRAM if an error occurs rather than doing a partial migration. This
> > is what currently happens so it is coped with.
> > 
> > If the memory is marked as must be in VRAM (NIY), well then the user
> > program has done something wrong and can kill the app (akin to
> > segfault).
> 
> Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.
> 
> > > > +
> > > > +	if (!migrate.cpages) {
> > > > +		err = -EFAULT;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	if (migrate.cpages != npages) {
> > > > +		err = -EBUSY;
> > > > +		goto err_finalize;
> > > > +	}
> 
> What I think is more fundamental is that I think this one here doesn't
> work. For migrate_to_ram you cannot assume that you can always migrate the
> entire block, I think to uphold the core mm forward progress rules we need
> to allow partial migrations there. And I think your current code allows
> that.
>

Yes. I had similar checks in migrate_to_ram at one point and that did
not work when multiple CPU faults from different threads occured in
parallel. Each thread can grab a random set of VRAM pages to migrate I
think.

> But that then means you also are stuck with partial migration state here.
> That was the point I tried to make.
>

The error path with migrate_vma_pages/finalize safely unwinds the
migration in these cases leaving all pages in SRAM.

> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +					struct vm_area_struct *vas,
> > > > +					struct page *page,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct migrate_vma migrate = {
> > > > +		.vma		= vas,
> > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page	= page,
> > > > +	};
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > 
> > > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > > called on a anon mapping from a child process. That also means that the
> > > vma you're looking at might have no relationship with anythign you're
> > > tracking in your gpusvm.
> > >
> > 
> > Hmm, as discussed [3] I haven't added tests with child processes yet.
> > Let me do that and update the design as needed. This likely isn't
> > correct as you say.
> > 
> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 
> 
> Ack. More tests should definitely help here to figure out what's up, and
> what's just me being confused.
> 

Starting to add tests this fork() appears to work after dropping these
asserts. More thorough testing is needed though.

> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > + * the internal migration function to migrate the range back to RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > +	int err;
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > 
> > > So I think zdd->range doesn't work, because even within a single mm the
> > > vma mapping a given piece of anon memory does not need to be unique, you
> > > can duplicate them with mremap.
> > > 
> > 
> > This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> > similar lookup mechanism.
> 
> Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
> which I think isn't ok.
> 

Yes, this gets a little confusing with fork() and mremap. The range's
start / end can be nonsense in the remap case. Also as you mention a
range->migrate_mutex doesn't seem correct either. I can make it work but
maybe not worth even typing out why here (I can provide a little more
detail in another reply). New thinking is zdd stores a size field and
has the locking - I think is akin to a VRAM folio then.

> > > So all you have here is the physical memory and the vma, which might or
> > > might not be from the same process as gpusvm->mm.
> > > 
> > > Also the child process scenario means you using mmap_write on the fault
> > > side doesn't stop all cpu faults migrating stuff back.
> > > 
> > > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > > is busted, because it's va based and so misses concurrently ongoing
> > > different mappings moving physical storage around underneath.
> > >
> > 
> > I think all of the above which falls into the fork() + child process
> > issues which you have raise. Until I test this out I can't speak to this
> > any level of confidence so I won't. Thanks for raising this issue and
> > let me write test cases as discussed and educate myself. Once I do that,
> > we can engage in further discussions.
> 
> I think fork + childs will still result in zdd->range being unique (albeit
> confused about which mm). You need mremap of some of these mappings to

Agree for fork + child based on initial testing.

> change the addresses and really cause confusion, which I /think/ (but
> didn't test) is doable with a single process even and duplicating anon

Yep, remap changes the address so range is confusing and really size is
sufficient aligning within VMA's start / end upon CPU fault. AMD does
this but with a VMA search which I think is a bit overkill.

Matt

> memory mappings with mremap.
> 
> Cheers, Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch