On Thu, 23 Jun 2022 at 16:31, Matthew Auld <matthew.auld@xxxxxxxxx> wrote: > > On 23/06/2022 15:52, Christian König wrote: > > Am 23.06.22 um 16:13 schrieb Matthew Auld: > >> [SNIP] > >>>> TTM_BO_VM_NUM_PREFAULT); > >>>> + /* > >>>> + * Ensure we check for any fatal errors if we had to > >>>> move/clear > >>>> + * the object. The device should already be wedged if > >>>> we hit > >>>> + * such an error. > >>>> + */ > >>>> + if (i915_gem_object_wait_moving_fence(obj, true)) > >>>> + ret = VM_FAULT_SIGBUS; > >>> > >>> We should check with Christian here whether it's ok to export > >>> ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks > >>> while waiting. The above is not a bug, but causes us to wait for the > >>> moving fence under the mmap_lock, which is considered bad. > >> > >> Christian, any chance we can export ttm_bo_vm_fault_idle() for use > >> here? Or is that NACK? > > > > Well question is why you want to do this? E.g. what's the background? > > Right, so basically we need to prevent userspace from being able to > access the pages for the object, if the ttm blit/move hits an error > (some kind of GPU error). Normally we can just fall back to > memcpy/memset to ensure we never leak anything (i915 is never allowed to > hand userspace non-zeroed memory even for VRAM), but with small-BAR > systems this might not be possible. Anyway, if we do hit an error during > the ttm move we might now mark the object as being in an "unknown state" > before signalling the fence. Later when binding the GPU page-tables we > check for the "unknown state" and skip the bind (it will end up just > pointing to some scratch pages instead). And then here on the CPU side, > we need to sync against all the kernel fences, before then checking for > the potential "unknown state", which is then handled by returning SIBUS. > The i915_gem_object_wait_moving_fence() is basically doing exactly that, > but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. And > then while all this going on we then also "wedge" the device to > basically signal that it's busted, which should prevent further work > being submitted to the gpu. Gentle ping? > > > > > Regards, > > Christian.