On Fri, 27 May 2022 11:55:42 +0100 Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote: > On 27/05/2022 10:09, Mauro Carvalho Chehab wrote: > > i915 selftest hangcheck is causing the i915 driver timeouts, as > > reported by Intel CI: > > > > http://gfx-ci.fi.intel.com/cibuglog-ng/issuefilterassoc/24297?query_key=42a999f48fa6ecce068bc8126c069be7c31153b4 > > > > When such test runs, the only output is: > > > > [ 68.811639] i915: Performing live selftests with st_random_seed=0xe138eac7 st_timeout=500 > > [ 68.811792] i915: Running hangcheck > > [ 68.811859] i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck > > [ 68.816910] i915 0000:00:02.0: [drm] Cannot find any crtc or sizes > > [ 68.841597] i915: Running intel_hangcheck_live_selftests/igt_reset_nop > > [ 69.346347] igt_reset_nop: 80 resets > > [ 69.362695] i915: Running intel_hangcheck_live_selftests/igt_reset_nop_engine > > [ 69.863559] igt_reset_nop_engine(rcs0): 709 resets > > [ 70.364924] igt_reset_nop_engine(bcs0): 903 resets > > [ 70.866005] igt_reset_nop_engine(vcs0): 659 resets > > [ 71.367934] igt_reset_nop_engine(vcs1): 549 resets > > [ 71.869259] igt_reset_nop_engine(vecs0): 553 resets > > [ 71.882592] i915: Running intel_hangcheck_live_selftests/igt_reset_idle_engine > > [ 72.383554] rcs0: Completed 16605 idle resets > > [ 72.884599] bcs0: Completed 18641 idle resets > > [ 73.385592] vcs0: Completed 17517 idle resets > > [ 73.886658] vcs1: Completed 15474 idle resets > > [ 74.387600] vecs0: Completed 17983 idle resets > > [ 74.387667] i915: Running intel_hangcheck_live_selftests/igt_reset_active_engine > > [ 74.889017] rcs0: Completed 747 active resets > > [ 75.174240] intel_engine_reset(bcs0) failed, err:-110 > > [ 75.174301] bcs0: Completed 525 active resets > > > > After that, the machine just silently hangs. > > > > The root cause is that the flush TLB logic is not working as > > expected on GEN8. > > > > Tested on an Intel NUC5i7RYB with an i7-5557U Broadwell CPU. > > > > This patch partially reverts the logic by skipping GEN8 from > > the TLB cache flush. > > Since I am pretty sure no such failures were spotted when merging the > feature I assume the failure is sporadic and/or limited to some > configurations? Do you have any details there? Because it is an > important security issue we should not revert it lightly. It occurs every time here: https://intel-gfx-ci.01.org/tree/drm-tip/fi-bdw-5557u.html It also happens on my own NUC5i7RYB every time when the TLB patch is applied. Reverting it (or applying this fix) is enough for hangcheck to pass. I suspect that TLB flush never happens there, causing ETIMEOUT at hangcheck. It could indeed be limited to some specific setups. I dunno. The only Gen8 machine I have access is my own NUC. So, I can't test it elsewhere. Regards, Mauro