The issue with stale virtual breadcrumbs remain. Now we have the problem that if the irq-signaler is still referencing the stale breadcrumb as we transfer it to a new sibling, the list becomes spaghetti. This is a very small window, but that doesn't stop it being hit infrequently. To prevent the lists being tangled (the iterator starting on one engine's b->signalers but walking onto another list), always decouple the virtual breadcrumb on schedule-out and make sure that the walker has stepped out of the lists. Otherwise, we end up corrupting the lists upon signaling: <4> [412.623150] list_add corruption. prev->next should be next (ffff888284a98470), but was dead000000000100. (prev=ffff8882586fb870). <4> [412.623167] WARNING: CPU: 7 PID: 0 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70 <4> [412.623169] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio mei_hdcp x86_pkg_temp_thermal ax88179_178a usbnet coretemp snd_intel_dspcfg mii snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm e1000e mei_me mei ptp pps_core prime_numbers intel_lpss_pci [last unloaded: i915] <4> [412.623191] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G U 5.9.0-rc2-CI-CI_DRM_8925+ #1 <4> [412.623194] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake Y LPDDR4x T4 RVP TLC, BIOS ICLSFWR1.R00.3212.A00.1905212112 05/21/2019 <4> [412.623197] RIP: 0010:__list_add_valid+0x4d/0x70 <4> [412.623200] Code: c3 48 89 d1 48 c7 c7 18 57 33 82 48 89 c2 e8 fa 78 b6 ff 0f 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 68 57 33 82 e8 e3 78 b6 ff <0f> 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 b8 57 33 82 e8 <4> [412.623202] RSP: 0018:ffffc90000280e18 EFLAGS: 00010086 <4> [412.623205] RAX: 0000000000000000 RBX: ffff88828dbd8d00 RCX: 0000000000000105 <4> [412.623208] RDX: 0000000000000105 RSI: ffffffff82324b63 RDI: 00000000ffffffff <4> [412.623210] RBP: ffff8882586fb870 R08: ffff88829a144998 R09: 00000000fffffffe <4> [412.623212] R10: 00000000c4bbeb58 R11: 00000000eb0f6cb0 R12: ffff888284a98428 <4> [412.623215] R13: ffff88827d7db880 R14: ffff88827d7db848 R15: ffff888284a98470 <4> [412.623217] FS: 0000000000000000(0000) GS:ffff88829c180000(0000) knlGS:0000000000000000 <4> [412.623220] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [412.623222] CR2: 0000561612b6e6e0 CR3: 0000000005610002 CR4: 0000000000770ee0 <4> [412.623224] PKRU: 55555554 <4> [412.623227] Call Trace: <4> [412.623229] <IRQ> <4> [412.623307] i915_request_enable_breadcrumb+0x278/0x400 [i915] <4> [412.623391] __i915_request_submit+0xca/0x2a0 [i915] <4> [412.623461] __execlists_submission_tasklet+0x480/0x1830 [i915] <4> [412.623572] execlists_submission_tasklet+0xc4/0x130 [i915] <4> [412.623585] tasklet_action_common.isra.17+0x198/0x1d0 <4> [412.623606] __do_softirq+0xe0/0x497 <4> [412.623619] ? handle_fasteoi_irq+0x150/0x150 <4> [412.623622] asm_call_on_stack+0xf/0x20 <4> [412.623630] </IRQ> <4> [412.623634] do_softirq_own_stack+0x73/0x90 <4> [412.623645] irq_exit_rcu+0xa9/0xb0 <4> [412.623648] common_interrupt+0xdd/0x210 <4> [412.623652] asm_common_interrupt+0x1e/0x40 So far this has only been seen in our selftests that emphasize lots of fast sibling switches for the virtual engines. Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> --- drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 5 +++-- drivers/gpu/drm/i915/gt/intel_lrc.c | 15 +++++++++++++++ 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c index f5f6feed0fa6..5b84f51931d9 100644 --- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c +++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c @@ -461,15 +461,16 @@ void i915_request_cancel_breadcrumb(struct i915_request *rq) { struct intel_breadcrumbs *b = READ_ONCE(rq->engine)->breadcrumbs; struct intel_context *ce = rq->context; + unsigned long flags; bool release; if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags)) return; - spin_lock(&ce->signal_lock); + spin_lock_irqsave(&ce->signal_lock, flags); list_del_rcu(&rq->signal_link); release = remove_signaling_context(b, ce); - spin_unlock(&ce->signal_lock); + spin_unlock_irqrestore(&ce->signal_lock, flags); if (release) intel_context_put(ce); diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index f89e4f62c88d..6e57f30c80f0 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1387,6 +1387,21 @@ static inline void execlists_schedule_in(struct i915_request *rq, int idx) static void kick_siblings(struct i915_request *rq, struct intel_context *ce) { struct virtual_engine *ve = container_of(ce, typeof(*ve), context); + struct intel_engine_cs *engine = rq->engine; + + /* Flush concurrent rcu iterators in signal_irq_work */ + if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags)) { + /* + * After this point, the rq may be transferred to a new + * sibling, so before we clear ce->inflight make sure that + * the context has been removed from the b->signalers and + * furthermore we need to make sure that the concurrent + * iterator in signal_irq_work is no longer following + * ce->signal_link. + */ + i915_request_cancel_breadcrumb(rq); + irq_work_sync(&engine->breadcrumbs->irq_work); + } if (READ_ONCE(ve->request)) tasklet_hi_schedule(&ve->base.execlists.tasklet); -- 2.20.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx