Re: [PATCH v5 00/32] Introduce GPU SVM and Xe SVM implementation

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Fri, 14 Feb 2025 09:47:13 +0100

Hi

On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > Version 5 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
> > Alistair, Himal) for their numerous reviews on revision 1, 2, 3 
> > and for
> > helping to address many design issues.
> > 
> > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > Also
> > tested with level0 (UMD) PR [2].
> 
> What is the plan to deal with not being able to preempt while a page
> fault is pending?  This seems like an easy DoS vector.  My
> understanding
> is that SVM is mostly used by compute workloads on headless systems.
> Recent AMD client GPUs don't support SVM, so programs that want to
> run
> on client systems should not require SVM if they wish to be portable.
> 
> Given the potential for abuse, I think it would be best to require
> explicit administrator opt-in to enable SVM, along with possibly
> having
> a timeout to resolve a page fault (after which the context is
> killed).
> Since I expect most uses of SVM to be in the datacenter space (for
> the
> reasons mentioned above), I don't believe this will be a major
> limitation in practice.  Programs that wish to run on client systems
> already need to use explicit memory transfer or pinned userptr, and
> administrators of compute clusters should be willing to enable this
> feature because only one workload will be using a GPU at a time.

While not directly having addressed the potential DoS issue you
mention, there is an associated deadlock possibility that may happen
due to not being able to preempt a pending pagefault. That is if a dma-
fence job is requiring the same resources held up by the pending page-
fault, and then the pagefault servicing is dependent on that dma-fence
to be signaled in one way or another.

That deadlock is handled by only allowing either page-faulting jobs or
dma-fence jobs on a resource (hw engine or hw engine group) that can be
used by both at a time, blocking synchronously in the exec IOCTL until
the resource is available for the job type. That means LR jobs waits
for all dma-fence jobs to complete, and dma-fence jobs wait for all LR
jobs to preempt. So a dma-fence job wait could easily mean "wait for
all outstanding pagefaults to be serviced".

Whether, on the other hand, that is a real DoS we need to care about,
is probably a topic for debate. The directions we've had so far are
that it's not. Nothing is held up indefinitely, what's held up can be
Ctrl-C'd by the user and core mm memory management is not blocked since
mmu_notifiers can execute to completion and shrinkers / eviction can
execute while a page-fault is pending.

Thanks,
Thomas