Re: [PATCH v2 3/3] drm/msm: Temporarily disable stall-on-fault after a page fault

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 21 Jan 2025 17:08:18 -0400

On Mon, Jan 20, 2025 at 10:46:47AM -0500, Connor Abbott wrote:

> To work around these problem, disable stall-on-fault as soon as we get a
> page fault until a cooldown period after pagefaults stop. This allows
> the GMU some guaranteed time to continue working. We also keep it
> disabled so long as the current devcoredump hasn't been deleted, because
> in that case we likely won't capture another one if there's a fault.

I don't have any particular interest here, but I'm surprised to read
this paragraph, maybe you could explain this some more in the commit
message?

I would think terminating transactions and returning a failure to the
GPU would be fatal to the GPU operating model when the entire point of
stall and fault handling is to make OS paging transparent to the GPU??

What happens on the GPU side when it gets this spurious failure?

Jason