On 23/06/2022 15:52, Christian König wrote:
Am 23.06.22 um 16:13 schrieb Matthew Auld:
[SNIP]
TTM_BO_VM_NUM_PREFAULT);
+ /*
+ * Ensure we check for any fatal errors if we had to
move/clear
+ * the object. The device should already be wedged if
we hit
+ * such an error.
+ */
+ if (i915_gem_object_wait_moving_fence(obj, true))
+ ret = VM_FAULT_SIGBUS;
We should check with Christian here whether it's ok to export
ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks
while waiting. The above is not a bug, but causes us to wait for the
moving fence under the mmap_lock, which is considered bad.
Christian, any chance we can export ttm_bo_vm_fault_idle() for use
here? Or is that NACK?
Well question is why you want to do this? E.g. what's the background?
Right, so basically we need to prevent userspace from being able to
access the pages for the object, if the ttm blit/move hits an error
(some kind of GPU error). Normally we can just fall back to
memcpy/memset to ensure we never leak anything (i915 is never allowed to
hand userspace non-zeroed memory even for VRAM), but with small-BAR
systems this might not be possible. Anyway, if we do hit an error during
the ttm move we might now mark the object as being in an "unknown state"
before signalling the fence. Later when binding the GPU page-tables we
check for the "unknown state" and skip the bind (it will end up just
pointing to some scratch pages instead). And then here on the CPU side,
we need to sync against all the kernel fences, before then checking for
the potential "unknown state", which is then handled by returning SIBUS.
The i915_gem_object_wait_moving_fence() is basically doing exactly that,
but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. And
then while all this going on we then also "wedge" the device to
basically signal that it's busted, which should prevent further work
being submitted to the gpu.
Regards,
Christian.