Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Christian König <christian.koenig@xxxxxxx> · Mon, 2 Sep 2024 13:01:45 +0200

Am 30.08.24 um 00:12 schrieb Matthew Brost:
On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
But as Sima pointed out in private communication, exhaustive eviction
is not really needed for faulting to make (crawling) progress.
Watermarks and VRAM trylock shrinking should suffice, since we're
strictly only required to service a single gpu page granule at a time.

However, ordinary bo-based jobs would still like to be able to
completely evict SVM vram. Whether that is important enough to strive
for is ofc up for discussion.
My take is that you don't win anything for exhaustive eviction by having
the dma_resv somewhere in there for svm allocations. Roughly for split lru
world, where svm ignores bo/dma_resv:

When evicting vram from the ttm side we'll fairly switch between selecting
bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
will eventually succeed in vacuuming up everything (with a few retries
perhaps, if we're not yet at the head of the ww ticket queue).

svm pages we need to try to evict anyway - there's no guarantee, becaue
the core mm might be holding temporary page references (which block
Yea, but think you can could kill the app then - not suggesting we
should but could. To me this is akin to a CPU fault and not being able
to migrate the device pages - the migration layer doc says when this
happens kick this to user space and segfault the app.

That's most likely a bad idea. That the core holds a temporary page 
reference can happen any time without any bad doing from the 
application. E.g. for direct I/O, swapping etc...

So you can't punish the application with a segfault if you happen to not 
be able to migrate a page because it has a reference.

Regards,
Christian.

My last patch in the series adds some asserts to see if this ever
happens, thus far never. If it occurs we could gracefully handle it by
aborting the migration I guess... I think the user really needs to
something a bit crazy to trigger this condition - I don't think the core
randomly grabs refs to device pages but could be wrong.

migration) or have the page locked (which also block the migration). But
as long as those two steps succeed, we'll win and get the pages. There
might be some thrashing against concurrent svm faults stealing them again,
but they have a disadvantage since they can't steal dma_resv_locked bo.
And if it's still too much we can stall them in the page allocator.

So it's not entirely reliable, but should be close enough.

Now for bo based svm the picture isn't any different, because holding
dma_resv is not actually enough to migrate svm mappings. We still need to
hope there's no temporary page references around, and we still need to
succeed at locking the page. And the migration code only does trylocks,
because that's it's deadlock prevent algorithm if different migrations
needing the same set of pages, but acquiring them in a different order. So
we win nothing.
Ok, maybe my statement above is false...

Wouldn't be the only time this falls is if another migration is in
flight (e.g. CPU fault) and they race? Then the eviction will naturally
happen via refcount being dropped from the other migration. I guess I
likely need to update my eviction code to not free the TTM resource if
all pages are not migrated.

Worse, if dma_resv does actually hold up svm migration and reclaim, then
we potentially deadlock because that lock is for a bigger range than
individual pages (or folios). And the core mm assumes that it can get out
of a deadlock bind by (at least stochastically) eventually succeeding in
acquiring/locking down a single page.

This means we cannot use dma_resv tricks to give the ttm world an
advantage in exhaustive eviction against concurrent svm faults. Or at
least not more than we can do without by just stalling svm faults that
need to allocate gpu memory (but that must happen without holding locks or
we're busted).

I'm a little lost here on the deadlock case. Do you mean when we try to
evict SVM BO we trigger reclaim by allocating system pages and can
deadlock? Doesn't TTM already have this dependency when evicting non-SVM
BOs?

So the only benefit I'm seeing is the unified lru, which I'm not sure is
worth it. There's also a bit a lru design tension here, because for the bo
Well also not rewriting the world...

Matt

world we want objects that are locked to stay on the lru, so that the
competing processes can figure out who has the winning ww ticket. The core
mm design otoh does isolate pages and remove them from the lru when
they're acquired, so that they don't gunk up other processes from trying
to make forward progress and are better hidden. Which reduces temporary
page references (from lru walk) preventing migration and stuff like that.

Cheers, Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch