Re: [PATCH v2 10/12] drm/i915/ttm: handle blitter failure on DG2

Matthew Auld <matthew.auld@xxxxxxxxx> · Thu, 23 Jun 2022 16:31:17 +0100

On 23/06/2022 15:52, Christian König wrote:
Am 23.06.22 um 16:13 schrieb Matthew Auld:
[SNIP]
TTM_BO_VM_NUM_PREFAULT);
+               /*
+                * Ensure we check for any fatal errors if we had to
move/clear
+                * the object. The device should already be wedged if
we hit
+                * such an error.
+                */
+               if (i915_gem_object_wait_moving_fence(obj, true))
+                       ret = VM_FAULT_SIGBUS;

We should check with Christian here whether it's ok to export
ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks
while waiting. The above is not a bug, but causes us to wait for the
moving fence under the mmap_lock, which is considered bad.

Christian, any chance we can export ttm_bo_vm_fault_idle() for use 
here? Or is that NACK?

Well question is why you want to do this? E.g. what's the background?

Right, so basically we need to prevent userspace from being able to 
access the pages for the object, if the ttm blit/move hits an error 
(some kind of GPU error). Normally we can just fall back to 
memcpy/memset to ensure we never leak anything (i915 is never allowed to 
hand userspace non-zeroed memory even for VRAM), but with small-BAR 
systems this might not be possible. Anyway, if we do hit an error during 
the ttm move we might now mark the object as being in an "unknown state" 
before signalling the fence. Later when binding the GPU page-tables we 
check for the "unknown state" and skip the bind (it will end up just 
pointing to some scratch pages instead). And then here on the CPU side, 
we need to sync against all the kernel fences, before then checking for 
the potential "unknown state", which is then handled by returning SIBUS. 
The i915_gem_object_wait_moving_fence() is basically doing exactly that, 
but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. And 
then while all this going on we then also "wedge" the device to 
basically signal that it's busted, which should prevent further work 
being submitted to the gpu.

Regards,
Christian.