Re: [PATCH v2 10/12] drm/i915/ttm: handle blitter failure on DG2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/06/2022 15:52, Christian König wrote:
Am 23.06.22 um 16:13 schrieb Matthew Auld:
[SNIP]
TTM_BO_VM_NUM_PREFAULT);
+               /*
+                * Ensure we check for any fatal errors if we had to
move/clear
+                * the object. The device should already be wedged if
we hit
+                * such an error.
+                */
+               if (i915_gem_object_wait_moving_fence(obj, true))
+                       ret = VM_FAULT_SIGBUS;

We should check with Christian here whether it's ok to export
ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks
while waiting. The above is not a bug, but causes us to wait for the
moving fence under the mmap_lock, which is considered bad.

Christian, any chance we can export ttm_bo_vm_fault_idle() for use here? Or is that NACK?

Well question is why you want to do this? E.g. what's the background?

Right, so basically we need to prevent userspace from being able to access the pages for the object, if the ttm blit/move hits an error (some kind of GPU error). Normally we can just fall back to memcpy/memset to ensure we never leak anything (i915 is never allowed to hand userspace non-zeroed memory even for VRAM), but with small-BAR systems this might not be possible. Anyway, if we do hit an error during the ttm move we might now mark the object as being in an "unknown state" before signalling the fence. Later when binding the GPU page-tables we check for the "unknown state" and skip the bind (it will end up just pointing to some scratch pages instead). And then here on the CPU side, we need to sync against all the kernel fences, before then checking for the potential "unknown state", which is then handled by returning SIBUS. The i915_gem_object_wait_moving_fence() is basically doing exactly that, but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. And then while all this going on we then also "wedge" the device to basically signal that it's busted, which should prevent further work being submitted to the gpu.


Regards,
Christian.



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux