RE: Regression on drm-tip

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Richard Fitzgerald <rf@xxxxxxxxxxxxxxxxxxxxx>
> Sent: Wednesday, January 31, 2024 4:05 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>
> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani <jani.saarinen@xxxxxxxxx>;
> David Gow <davidgow@xxxxxxxxxx>; kunit-dev@xxxxxxxxxxxxxxxx; linux-
> kselftest@xxxxxxxxxxxxxxx
> Subject: Re: Regression on drm-tip
> 
> On 31/1/24 05:34, Borah, Chaitanya Kumar wrote:
> > Hello Richard,
> >
> > Hope you are doing well. I am Chaitanya from the Linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on drm-
> tip[2] repository.
> > These are captured by gitlab issues[3].
> >
> > We bisected the issue and have found the following commit to be the first
> bad commit.
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> > commit a0b84213f947176ddcd0e96e0751a109f28cde21
> > Author: Richard Fitzgerald rf@xxxxxxxxxxxxxxxxxxxxx
> > Date:   Mon Dec 18 15:17:29 2023 +0000
> >
> >      kunit: Fix NULL-dereference in kunit_init_suite() if suite->log
> > is NULL
> >
> >      suite->log must be checked for NULL before passing it to
> >      string_stream_clear(). This was done in kunit_init_test() but was missing
> >      from kunit_init_suite().
> >
> >      Signed-off-by: Richard Fitzgerald rf@xxxxxxxxxxxxxxxxxxxxx
> >      Fixes: 6d696c4695c5 ("kunit: add ability to run tests after boot using
> debugfs")
> >      Reviewed-by: Rae Moar rmoar@xxxxxxxxxx
> >      Acked-by: David Gow davidgow@xxxxxxxxxx
> >      Reviewed-by: Muhammad Usama Anjum usama.anjum@xxxxxxxxxxxxx
> >      Signed-off-by: Shuah Khan skhan@xxxxxxxxxxxxxxxxxxx
> >
> > lib/kunit/test.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> > We tried reverting the patch and the original issue is not seen but it results
> in NULL pointer deference[4] which I am guessing is expected.
> >
> > Could you please check why the patch causes this regression and provide a
> fix if necessary?
> >
> > [1] https://intel-gfx-ci.01.org/tree/drm-tip/index.html?testfilter=drm
> > [2] https://cgit.freedesktop.org/drm-tip/
> > [3] https://gitlab.freedesktop.org/drm/intel/-/issues/10140
> >        https://gitlab.freedesktop.org/drm/intel/-/issues/10143
> > [4]
> > 	[  179.849411] [IGT] drm_buddy: executing
> > 	[  179.856385] [IGT] drm_buddy: starting subtest drm_buddy
> > 	[  179.862594] KTAP version 1
> > 	[  179.862600] 1..1
> > 	[  179.863375] BUG: kernel NULL pointer dereference, address:
> 0000000000000030
> > 	[  179.863381] #PF: supervisor read access in kernel mode
> > 	[  179.863384] #PF: error_code(0x0000) - not-present page
> > 	[  179.863387] PGD 0 P4D 0
> > 	[  179.863391] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > 	[  179.863395] CPU: 1 PID: 1319 Comm: drm_buddy Not tainted 6.8.0-
> rc1-bisecttrail015 #16
> > 	[  179.863398] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3471.D81.2311291340 11/29/2023
> > 	[  179.863400] RIP: 0010:__lock_acquire+0x71f/0x2300
> > 	[  179.863408] Code: 84 03 06 00 00 44 8b 15 27 f6 72 01 45 85 d2 0f
> 84 9c 00 00 00 f6 45 22 10 0f 84 63 03 00 00 41 bf 01 00 00 00 e9 8a 00 00 00
> <48> 81 3f 40 d7 fa 82 41 b9 00 00 00 00 45 0f 	45 c8 83 fe 01 0f 87
> > 	...
> > 	[  179.863445] PKRU: 55555554
> > 	[  179.863448] Call Trace:
> > 	[  179.863450]  <TASK>
> > 	[  179.863453]  ? __die_body+0x1a/0x60
> > 	[  179.863459]  ? page_fault_oops+0x156/0x450
> > 	[  179.863465]  ? do_user_addr_fault+0x65/0x9e0
> > 	[  179.863472]  ? exc_page_fault+0x68/0x1a0
> > 	[  179.863479]  ? asm_exc_page_fault+0x26/0x30
> > 	[  179.863487]  ? __lock_acquire+0x71f/0x2300
> > 	[  179.863493]  ? __pfx_do_sync_core+0x10/0x10
> > 	[  179.863500]  lock_acquire+0xd8/0x2d0
> > 	[  179.863505]  ? string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863523]  _raw_spin_lock+0x2e/0x40
> > 	[  179.863528]  ? string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863540]  string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863554]  __kunit_test_suites_init+0x7e/0xe0 [kunit]
> > 	[  179.863568]  kunit_module_notify+0x20f/0x220 [kunit]
> > 	[  179.863583]  notifier_call_chain+0x46/0x130
> > 	[  179.863591]  notifier_call_chain_robust+0x3e/0x90
> > 	[  179.863598]  blocking_notifier_call_chain_robust+0x42/0x60
> > 	[  179.863605]  load_module+0x1bcd/0x1f80
> > 	[  179.863617]  ? init_module_from_file+0x86/0xd0
> > 	[  179.863621]  init_module_from_file+0x86/0xd0
> > 	[  179.863629]  idempotent_init_module+0x17c/0x230
> > 	[  179.863637]  __x64_sys_finit_module+0x56/0xb0
> > 	[  179.863642]  do_syscall_64+0x6f/0x140
> > 	[  179.863649]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > 	[  179.863654] RIP: 0033:0x7f0e6676195d
> 
> Looking at the gitlab bug reports compared to the crash log above:
> 
> [3] You have hit a failure on the 3rd test case:
> 
>      <6> [59.039608] [IGT] drm_buddy: starting dynamic subtest
>      drm_test_buddy_alloc_limit
>      <6> [59.077701] KTAP version 1
>      <6> [59.077705] 1..1
>      <6> [59.078487]     KTAP version 1
>      <6> [59.078494]     # Subtest: drm_buddy
>      <6> [59.078496]     # module: drm_buddy_test
>      <6> [59.078498]     1..4
>      <6> [59.079321]     ok 1 drm_test_buddy_alloc_limit
>      <6> [59.079973]     ok 2 drm_test_buddy_alloc_optimistic
>      <6> [59.080479] [IGT] drm_buddy: finished subtest
>      drm_test_buddy_alloc_limit, SUCCESS
> 
> When you revert my NULL-dereference bugfix, you are hitting the NULL
> dereference crash immediately, before executing the test case that is causing
> [3].
> 
>      > [  179.862594] KTAP version 1
>      > [  179.862600] 1..1
>      > [  179.863375] BUG: kernel NULL pointer dereference
> 
> So, my commit is not causing your [3]. It is allowing you to reach your test
> case that is causing [3].

Understood. I think we pulled the trigger too soon on this one.

I see that David has sent a quick patch. We will check if that helps.

Regards

Chaitanya




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux