Re: [PATCH v5 00/32] Introduce GPU SVM and Xe SVM implementation

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Fri, 14 Feb 2025 17:26:48 +0100

Hi!

On Fri, 2025-02-14 at 11:14 -0500, Demi Marie Obenour wrote:
> On Fri, Feb 14, 2025 at 09:47:13AM +0100, Thomas Hellström wrote:
> > Hi
> > 
> > On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> > > On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > > > Version 5 of GPU SVM. Thanks to everyone (especially Sima,
> > > > Thomas,
> > > > Alistair, Himal) for their numerous reviews on revision 1, 2,
> > > > 3 
> > > > and for
> > > > helping to address many design issues.
> > > > 
> > > > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > > > Also
> > > > tested with level0 (UMD) PR [2].
> > > 
> > > What is the plan to deal with not being able to preempt while a
> > > page
> > > fault is pending?  This seems like an easy DoS vector.  My
> > > understanding
> > > is that SVM is mostly used by compute workloads on headless
> > > systems.
> > > Recent AMD client GPUs don't support SVM, so programs that want
> > > to
> > > run
> > > on client systems should not require SVM if they wish to be
> > > portable.
> > > 
> > > Given the potential for abuse, I think it would be best to
> > > require
> > > explicit administrator opt-in to enable SVM, along with possibly
> > > having
> > > a timeout to resolve a page fault (after which the context is
> > > killed).
> > > Since I expect most uses of SVM to be in the datacenter space
> > > (for
> > > the
> > > reasons mentioned above), I don't believe this will be a major
> > > limitation in practice.  Programs that wish to run on client
> > > systems
> > > already need to use explicit memory transfer or pinned userptr,
> > > and
> > > administrators of compute clusters should be willing to enable
> > > this
> > > feature because only one workload will be using a GPU at a time.
> > 
> > While not directly having addressed the potential DoS issue you
> > mention, there is an associated deadlock possibility that may
> > happen
> > due to not being able to preempt a pending pagefault. That is if a
> > dma-
> > fence job is requiring the same resources held up by the pending
> > page-
> > fault, and then the pagefault servicing is dependent on that dma-
> > fence
> > to be signaled in one way or another.
> > 
> > That deadlock is handled by only allowing either page-faulting jobs
> > or
> > dma-fence jobs on a resource (hw engine or hw engine group) that
> > can be
> > used by both at a time, blocking synchronously in the exec IOCTL
> > until
> > the resource is available for the job type. That means LR jobs
> > waits
> > for all dma-fence jobs to complete, and dma-fence jobs wait for all
> > LR
> > jobs to preempt. So a dma-fence job wait could easily mean "wait
> > for
> > all outstanding pagefaults to be serviced".
> > 
> > Whether, on the other hand, that is a real DoS we need to care
> > about,
> > is probably a topic for debate. The directions we've had so far are
> > that it's not. Nothing is held up indefinitely, what's held up can
> > be
> > Ctrl-C'd by the user and core mm memory management is not blocked
> > since
> > mmu_notifiers can execute to completion and shrinkers / eviction
> > can
> > execute while a page-fault is pending.
> 
> The problem is that a program that uses a page-faulting job can lock
> out
> all other programs on the system from using the GPU for an indefinite
> period of time.  In a GUI session, this means a frozen UI, which
> makes
> recovery basically impossible without drastic measures (like
> rebooting
> or logging in over SSH).  That counts as a quite effective denial of
> service from an end-user perspective, and unless I am mistaken it
> would
> be very easy to trigger by accident: just start a page-faulting job
> that
> loops forever.

I think the easiest remedy for this is that if a page-faulting job is
either by purpose or mistake crafted in such a way that it holds up
preemption when preemption is needed (like in the case I described, a
dma-fence job is submitted) the driver will hit a preemption timeout
and kill the pagefaulting job. (I think that is already handled in all
cases in the xe driver but I would need to double check). So this would
then boil down to the system administrator configuring the preemption
timeout.

Thanks,
Thomas