Hi all, I've dragged my feet for years on this, hoping that cross-release lockdep would do this for us, but well that never really happened unfortunately. So here we are. Cc'ed quite a pile of people since this is about the cross-driver contract around dma_fences. Which is heavily used for dma_buf, and I'm hearing more noises that rdma folks are looking into this, hence also on cc. There's a bunch of different parts to this RFC: - The annotations itself, in the 2nd patch after the prep patch to add might_sleep annotations. Commit message has all the motivation for what kind of deadlocks I want to catch, best you just read it. Since lockdep doesn't understand cross-release natively we need to cobble something together using rwlocks and a few more tricks, but from the test rollout in a few places in drm/vkms, amdgpu & i915 I think what I have now seems to actually work. Downside is that we have to explicitly annotate all code involved in eventual dma_fence signalling. - Second important part is locking down the current dma-fence cross-driver contract, using lockdep priming like we already do for dma_resv_lock. I've just started with my own take on what we probably need to make the current code work (-ish), but both amdgpu and i915 have issues with that. So this needs some careful discussions, and also some thought on how we land it all eventually to not break lockdep completely for everyone. The important patch for that is "dma-fence: prime lockdep annotations" plus of course the various annotations patches and driver hacks to highlight some of the issues caught. Note that depending upon what exactly we end up deciding we might need to improve the annotations for fs_reclaim_acquire/release - for dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui), and currently fs_reclaim_acquire/release only has a lockdep class for __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in general maybe. - Finally there's clearly some gaps in the current dma_fence driver interfaces: Amdgpu's hang recovery is essentially impossible to fix as-is - it needs to reset the display state and you can't get at modeset locks from tdr without deadlock potential. i915 has an internal trick (but it stops working once we involve real cross-driver fences) for this issues, but then for i915 modeset reset is only needed on very ancient gen2/3. Modern hw is a lot more reasonable. I'm kinda hoping that the annotations and priming for basic command submission and atomic modeset paths could be merged soonish, while we the tdr side clearly needs a pile more work to get going. But since we have to explicitly annotate all code paths anyway we can hide bugs in e.g. tdr code by simply not yet annotating those functions. I'm trying to lay out at least one idea for solving the tdr issue in the patch titled "drm/scheduler: use dma-fence annotations in tdr work". Finally, once we have some agreement on where we're going with all this, we also need some documentation. Currently that's missing because I don't want to re-edit the text all the time while we still figure out the details of the exact cross-driver semantics. My goal here is that with this we can lock down the cross-driver contract for the last bit of the dma_buf/resv/fence story and make sure this stops being such a wobbly thing where everyone just does whatever they feel like. Ideas, thoughts, reviews, testing (with specific annotations for that driver) on other drivers very much welcome. Cheers, Daniel Cc: linux-media@xxxxxxxxxxxxxxx Cc: linaro-mm-sig@xxxxxxxxxxxxxxxx Cc: linux-rdma@xxxxxxxxxxxxxxx Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx> Cc: Christian König <christian.koenig@xxxxxxx> Daniel Vetter (17): dma-fence: add might_sleep annotation to _wait() dma-fence: basic lockdep annotations dma-fence: prime lockdep annotations drm/vkms: Annotate vblank timer drm/vblank: Annotate with dma-fence signalling section drm/atomic-helper: Add dma-fence annotations drm/amdgpu: add dma-fence annotations to atomic commit path drm/scheduler: use dma-fence annotations in main thread drm/amdgpu: use dma-fence annotations in cs_submit() drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code drm/amdgpu: DC also loves to allocate stuff where it shouldn't drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail drm/scheduler: use dma-fence annotations in tdr work drm/amdgpu: use dma-fence annotations for gpu reset code Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" drm/amdgpu: gpu recovery does full modesets drm/i915: Annotate dma_fence_work drivers/dma-buf/dma-fence.c | 56 +++++++++++++++++++ drivers/dma-buf/dma-resv.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 +- drivers/gpu/drm/amd/amdgpu/atom.c | 2 +- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++- drivers/gpu/drm/amd/display/dc/core/dc.c | 4 +- drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++ drivers/gpu/drm/drm_vblank.c | 8 ++- drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 + drivers/gpu/drm/scheduler/sched_main.c | 11 ++++ drivers/gpu/drm/vkms/vkms_crtc.c | 8 ++- include/linux/dma-fence.h | 13 +++++ 16 files changed, 160 insertions(+), 13 deletions(-) -- 2.26.2