On 10/07/2015 10:17 AM, David Woodhouse wrote: > On Wed, 2015-10-07 at 09:28 -0700, Jesse Barnes wrote: >> On 10/07/2015 09:14 AM, Daniel Vetter wrote: >>> On Wed, Oct 07, 2015 at 08:16:42AM -0700, Jesse Barnes wrote: >>>> On 10/07/2015 06:00 AM, David Woodhouse wrote: >>>>> On Fri, 2015-09-04 at 09:59 -0700, Jesse Barnes wrote: >>>>>> + >>>>>> + ret = handle_mm_fault(mm, vma, address, >>>>>> + desc.wr_req ? FAULT_FLAG_WRITE : 0); >>>>>> + if (ret & VM_FAULT_ERROR) { >>>>>> + gpu_mm_segv(tsk, address, SEGV_ACCERR); /* ? */ >>>>>> + goto out_unlock; >>>>>> + } >>>>>> + >>>>> >>>>> Hm, do you need to force the SEGV there, in what ought to be generic >>>>> IOMMU code? >>>>> >>>>> Can you instead just let the fault handler return an appropriate >>>>> failure code to the IOMMU request queue and then deal with the >>>>> resulting error on the i915 device side? >>>> >>>> I'm not sure if we get enough info on the i915 side to handle it >>>> reasonably, we'll have to test that out. >>> >>> We do know precisely which context blew up, but without the TDR work we >>> can't yet just kill the offender selective without affecting the other >>> active gpu contexts. >> >> How? The notification from the IOMMU queue is asynchronous... > > The page request, and the response, include 'private data' which an > endpoint can use to carry that kind of information. > > In $7.5.1.1 of the VT-d specification it tells us: > > "Private Data: The Private Data field can be used by > Root-Complex integrated endpoints to uniquely identify > device-specific private information associated with an > individual page request. > > "For Intel ® Processor Graphics device, the Private Data field > specifies the identity of the GPU advanced-context (see > Section 3.10) sending the page request." Ah right so we could put our private context ID in there if the PASID doesn't end up being per-context. That would work fine (though as Daniel said we still need fancier reset to handle things more gracefully). >>> But besides that I really don't see a reason why we need to kill the >>> process if the gpu faults. After all if a thread sigfaults then signal >>> goes to that thread and not some random one (or the one thread that forked >>> the thread that blew up). And we do have interfaces to tell userspace that >>> something bad happened with the gpu work it submitted. > > I certainly don't want the core IOMMU code killing things. I really > want to just complete the page request with an appropriate failure > code, and let the endpoint device deal with it from there. I think that will work, but I want to test and make sure. In the driver mode version I took advantage of the fact that I got an unambiguous page request failure from the IOMMU along with a unique PASID to send the signal. Getting it on the GPU side means looking at some of our existing error state bits, which is something I've been avoiding... Jesse _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx