Re: [PATCH 6/9] drm/i915: driver based PASID handling

Tomas Elf <tomas.elf@xxxxxxxxx> · Thu, 08 Oct 2015 11:28:47 +0100

On 07/10/2015 17:14, Daniel Vetter wrote:
On Wed, Oct 07, 2015 at 08:16:42AM -0700, Jesse Barnes wrote:
On 10/07/2015 06:00 AM, David Woodhouse wrote:
On Fri, 2015-09-04 at 09:59 -0700, Jesse Barnes wrote:
+
+       ret = handle_mm_fault(mm, vma, address,
+                             desc.wr_req ? FAULT_FLAG_WRITE : 0);
+       if (ret & VM_FAULT_ERROR) {
+               gpu_mm_segv(tsk, address, SEGV_ACCERR); /* ? */
+               goto out_unlock;
+       }
+

Hm, do you need to force the SEGV there, in what ought to be generic
IOMMU code?

Can you instead just let the fault handler return an appropriate
failure code to the IOMMU request queue and then deal with the
resulting error on the i915 device side?

I'm not sure if we get enough info on the i915 side to handle it
reasonably, we'll have to test that out.

We do know precisely which context blew up, but without the TDR work we
can't yet just kill the offender selective without affecting the other
active gpu contexts.

Could someone clarify what this means from the TDR point of view, 
please? When you say "context blew up" I'm guessing that you mean that 
come context caused the fault handler to get involved somehow?

Does this imply that the offending context will hang and the driver will 
have to detect this hang? If so, then yes - if we have the per-engine 
hang recovery mode as part of the upcoming TDR work in place then we 
could handle it by stepping over the offending batch buffer and moving 
on with a minimum of side-effects on the rest of the driver/GPU.

Or does this imply that we have some new integration path to deal with? 
(something that I should be aware of when upstreaming TDR?)

Sorry if I missed something blatantly obvious in the patch ;).

Thanks,
Tomas

But besides that I really don't see a reason why we need to kill the
process if the gpu faults. After all if a thread sigfaults then signal
goes to that thread and not some random one (or the one thread that forked
the thread that blew up). And we do have interfaces to tell userspace that
something bad happened with the gpu work it submitted.

Chris made a similar patch for userptr and I didn't like that one either.
Worst case userspace has a special SEGV handler and then things really go
down badly when that handler gets triggered at an unexpected place.
-Daniel

That way, you should hopefully get to gracefully cope with reporting
errors for a specific *context*, rather than killing the whole process.

It would be best to get per-context error info, but killing the process
may be unavoidable (just as if a single thread clobbers memory in your
process).

Jesse
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx