Hi Janusz, On Tue, Jun 04, 2024 at 05:27:24PM +0200, Janusz Krzysztofik wrote: > On Tuesday, 4 June 2024 02:48:43 GMT+2 Andi Shyti wrote: > > On Mon, Jun 03, 2024 at 09:54:45PM +0200, Janusz Krzysztofik wrote: > > > CI has been sporadically reporting the following issue triggered by > > > igt@i915_selftest@live@hangcheck on ADL-P and similar machines: > > > > > > <6> [414.049203] i915: Running intel_hangcheck_live_selftests/igt_reset_evict_fence > > > ... > > > <6> [414.068804] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled > > > <6> [414.068812] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled > > > <3> [414.070354] Unable to pin Y-tiled fence; err:-4 > > > <3> [414.071282] i915_vma_revoke_fence:301 GEM_BUG_ON(!i915_active_is_idle(&fence->active)) > > > ... > > > <4>[ 609.603992] ------------[ cut here ]------------ > > > <2>[ 609.603995] kernel BUG at drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c:301! > > > <4>[ 609.604003] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > > > <4>[ 609.604006] CPU: 0 PID: 268 Comm: kworker/u64:3 Tainted: G U W 6.9.0-CI_DRM_14785-g1ba62f8cea9c+ #1 > > > <4>[ 609.604008] Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR4 RVP, BIOS RPLPFWI1.R00.4035.A00.2301200723 01/20/2023 > > > <4>[ 609.604010] Workqueue: i915 __i915_gem_free_work [i915] > > > <4>[ 609.604149] RIP: 0010:i915_vma_revoke_fence+0x187/0x1f0 [i915] > > > ... > > > <4>[ 609.604271] Call Trace: > > > <4>[ 609.604273] <TASK> > > > ... > > > <4>[ 609.604716] __i915_vma_evict+0x2e9/0x550 [i915] > > > <4>[ 609.604852] __i915_vma_unbind+0x7c/0x160 [i915] > > > <4>[ 609.604977] force_unbind+0x24/0xa0 [i915] > > > <4>[ 609.605098] i915_vma_destroy+0x2f/0xa0 [i915] > > > <4>[ 609.605210] __i915_gem_object_pages_fini+0x51/0x2f0 [i915] > > > <4>[ 609.605330] __i915_gem_free_objects.isra.0+0x6a/0xc0 [i915] > > > <4>[ 609.605440] process_scheduled_works+0x351/0x690 > > > ... > > > > > > In the past, there were similar failures reported by CI from other IGT > > > tests, observed on other platforms. > > > > > > Before commit 63baf4f3d587 ("drm/i915/gt: Only wait for GPU activity > > > before unbinding a GGTT fence"), i915_vma_revoke_fence() was waiting for > > > idleness of vma->active via fence_update(). That commit introduced > > > vma->fence->active in order for the fence_update() to be able to wait > > > selectively on that one instead of vma->active since only idleness of > > > fence registers was needed. But then, another commit 0d86ee35097a > > > ("drm/i915/gt: Make fence revocation unequivocal") replaced the call to > > > fence_update() in i915_vma_revoke_fence() with only fence_write(), and > > > also added that GEM_BUG_ON(!i915_active_is_idle(&fence->active)) in front. > > > No justification was provided on why we might then expect idleness of > > > vma->fence->active without first waiting on it. > > > > > > The issue can be potentially caused by a race among revocation of fence > > > registers on one side and sequential execution of signal callbacks invoked > > > on completion of a request that was using them on the other, still > > > processed in parallel to revocation of those fence registers. Fix it by > > > waiting for idleness of vma->fence->active in i915_vma_revoke_fence(). > > > > > > Fixes: 0d86ee35097a ("drm/i915/gt: Make fence revocation unequivocal") > > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/10021 > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@xxxxxxxxxxxxxxx> > > > Cc: stable@xxxxxxxxxxxxxxx # v5.8+ > > > > Just wondering whether we really need the stable kernel here. > > > > We have just an alleged failure reported on a selftest. I think > > we can drop the stable requirement. > > Please note there were similar failures from other tests reported by CI in the > past, e.g., https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/8846, > https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10638. Would you like > me to mention them in commit description. I guess that's fine. > OTOH, stable will probably pick this patch up themselves, based on the Fixes: > commit ID, even if we drop the Cc: stable. I was referring to the whole Fixes+stable... but you are right, let's keep them... just an oversight from me. I retriggered another test... but I'm pretty sure this is good to go. Thanks, Andi > Anyway, please feel free to drop Cc: stable, you or anyone who will be > pushing.