> -----Original Message----- > From: Richard Fitzgerald <rf@xxxxxxxxxxxxxxxxxxxxx> > Sent: Wednesday, January 31, 2024 4:05 PM > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx> > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar > <suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani <jani.saarinen@xxxxxxxxx>; > David Gow <davidgow@xxxxxxxxxx>; kunit-dev@xxxxxxxxxxxxxxxx; linux- > kselftest@xxxxxxxxxxxxxxx > Subject: Re: Regression on drm-tip > > On 31/1/24 05:34, Borah, Chaitanya Kumar wrote: > > Hello Richard, > > > > Hope you are doing well. I am Chaitanya from the Linux graphics team in > Intel. > > > > This mail is regarding a regression we are seeing in our CI runs[1] on drm- > tip[2] repository. > > These are captured by gitlab issues[3]. > > > > We bisected the issue and have found the following commit to be the first > bad commit. > > `````````````````````````````````````````````````````````````````````` > > ``````````````````````````````````` > > commit a0b84213f947176ddcd0e96e0751a109f28cde21 > > Author: Richard Fitzgerald rf@xxxxxxxxxxxxxxxxxxxxx > > Date: Mon Dec 18 15:17:29 2023 +0000 > > > > kunit: Fix NULL-dereference in kunit_init_suite() if suite->log > > is NULL > > > > suite->log must be checked for NULL before passing it to > > string_stream_clear(). This was done in kunit_init_test() but was missing > > from kunit_init_suite(). > > > > Signed-off-by: Richard Fitzgerald rf@xxxxxxxxxxxxxxxxxxxxx > > Fixes: 6d696c4695c5 ("kunit: add ability to run tests after boot using > debugfs") > > Reviewed-by: Rae Moar rmoar@xxxxxxxxxx > > Acked-by: David Gow davidgow@xxxxxxxxxx > > Reviewed-by: Muhammad Usama Anjum usama.anjum@xxxxxxxxxxxxx > > Signed-off-by: Shuah Khan skhan@xxxxxxxxxxxxxxxxxxx > > > > lib/kunit/test.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > `````````````````````````````````````````````````````````````````````` > > ``````````````````````````````````` > > We tried reverting the patch and the original issue is not seen but it results > in NULL pointer deference[4] which I am guessing is expected. > > > > Could you please check why the patch causes this regression and provide a > fix if necessary? > > > > [1] https://intel-gfx-ci.01.org/tree/drm-tip/index.html?testfilter=drm > > [2] https://cgit.freedesktop.org/drm-tip/ > > [3] https://gitlab.freedesktop.org/drm/intel/-/issues/10140 > > https://gitlab.freedesktop.org/drm/intel/-/issues/10143 > > [4] > > [ 179.849411] [IGT] drm_buddy: executing > > [ 179.856385] [IGT] drm_buddy: starting subtest drm_buddy > > [ 179.862594] KTAP version 1 > > [ 179.862600] 1..1 > > [ 179.863375] BUG: kernel NULL pointer dereference, address: > 0000000000000030 > > [ 179.863381] #PF: supervisor read access in kernel mode > > [ 179.863384] #PF: error_code(0x0000) - not-present page > > [ 179.863387] PGD 0 P4D 0 > > [ 179.863391] Oops: 0000 [#1] PREEMPT SMP NOPTI > > [ 179.863395] CPU: 1 PID: 1319 Comm: drm_buddy Not tainted 6.8.0- > rc1-bisecttrail015 #16 > > [ 179.863398] Hardware name: Intel Corporation Meteor Lake Client > Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS > MTLPFWI1.R00.3471.D81.2311291340 11/29/2023 > > [ 179.863400] RIP: 0010:__lock_acquire+0x71f/0x2300 > > [ 179.863408] Code: 84 03 06 00 00 44 8b 15 27 f6 72 01 45 85 d2 0f > 84 9c 00 00 00 f6 45 22 10 0f 84 63 03 00 00 41 bf 01 00 00 00 e9 8a 00 00 00 > <48> 81 3f 40 d7 fa 82 41 b9 00 00 00 00 45 0f 45 c8 83 fe 01 0f 87 > > ... > > [ 179.863445] PKRU: 55555554 > > [ 179.863448] Call Trace: > > [ 179.863450] <TASK> > > [ 179.863453] ? __die_body+0x1a/0x60 > > [ 179.863459] ? page_fault_oops+0x156/0x450 > > [ 179.863465] ? do_user_addr_fault+0x65/0x9e0 > > [ 179.863472] ? exc_page_fault+0x68/0x1a0 > > [ 179.863479] ? asm_exc_page_fault+0x26/0x30 > > [ 179.863487] ? __lock_acquire+0x71f/0x2300 > > [ 179.863493] ? __pfx_do_sync_core+0x10/0x10 > > [ 179.863500] lock_acquire+0xd8/0x2d0 > > [ 179.863505] ? string_stream_clear+0x29/0xb0 [kunit] > > [ 179.863523] _raw_spin_lock+0x2e/0x40 > > [ 179.863528] ? string_stream_clear+0x29/0xb0 [kunit] > > [ 179.863540] string_stream_clear+0x29/0xb0 [kunit] > > [ 179.863554] __kunit_test_suites_init+0x7e/0xe0 [kunit] > > [ 179.863568] kunit_module_notify+0x20f/0x220 [kunit] > > [ 179.863583] notifier_call_chain+0x46/0x130 > > [ 179.863591] notifier_call_chain_robust+0x3e/0x90 > > [ 179.863598] blocking_notifier_call_chain_robust+0x42/0x60 > > [ 179.863605] load_module+0x1bcd/0x1f80 > > [ 179.863617] ? init_module_from_file+0x86/0xd0 > > [ 179.863621] init_module_from_file+0x86/0xd0 > > [ 179.863629] idempotent_init_module+0x17c/0x230 > > [ 179.863637] __x64_sys_finit_module+0x56/0xb0 > > [ 179.863642] do_syscall_64+0x6f/0x140 > > [ 179.863649] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > > [ 179.863654] RIP: 0033:0x7f0e6676195d > > Looking at the gitlab bug reports compared to the crash log above: > > [3] You have hit a failure on the 3rd test case: > > <6> [59.039608] [IGT] drm_buddy: starting dynamic subtest > drm_test_buddy_alloc_limit > <6> [59.077701] KTAP version 1 > <6> [59.077705] 1..1 > <6> [59.078487] KTAP version 1 > <6> [59.078494] # Subtest: drm_buddy > <6> [59.078496] # module: drm_buddy_test > <6> [59.078498] 1..4 > <6> [59.079321] ok 1 drm_test_buddy_alloc_limit > <6> [59.079973] ok 2 drm_test_buddy_alloc_optimistic > <6> [59.080479] [IGT] drm_buddy: finished subtest > drm_test_buddy_alloc_limit, SUCCESS > > When you revert my NULL-dereference bugfix, you are hitting the NULL > dereference crash immediately, before executing the test case that is causing > [3]. > > > [ 179.862594] KTAP version 1 > > [ 179.862600] 1..1 > > [ 179.863375] BUG: kernel NULL pointer dereference > > So, my commit is not causing your [3]. It is allowing you to reach your test > case that is causing [3]. Understood. I think we pulled the trigger too soon on this one. I see that David has sent a quick patch. We will check if that helps. Regards Chaitanya