Hello,
On 6/8/2023 9:55 PM, Paul E. McKenney wrote:
On Thu, Jun 08, 2023 at 10:33:18AM +0200, Peter Zijlstra wrote:
On Wed, Jun 07, 2023 at 12:24:12PM -0700, Paul E. McKenney wrote:
On Wed, Jun 07, 2023 at 04:48:19PM +0530, Jain, Ayush wrote:
Hello All,
Observed null pointer dereference during rcutorture test on linux-next tree
from next-20230602.
Commit ID: commit bc708bbd8260ee4eb3428b0109f5f3be661fae46 (HEAD, tag: next-20230602)
Here I am attaching log trace
[12133.344278] rcu-torture: rcu_torture_read_exit: Start of test
[12133.344282] rcu-torture: rcu_torture_read_exit: Start of episode
[12138.350637] rcu-torture: rcu_torture_read_exit: End of episode
[12143.419412] smpboot: CPU 1 is now offline
[12143.427996] BUG: kernel NULL pointer dereference, address: 0000000000000128
[12143.435777] #PF: supervisor read access in kernel mode
[12143.441517] #PF: error_code(0x0000) - not-present page
[12143.447256] PGD 0 P4D 0
[12143.450087] Oops: 0000 [#1] PREEMPT SMP NOPTI
[12143.454955] CPU: 68 PID: 978653 Comm: rcu_torture_rea Kdump: loaded Not tainted 6.4.0-rc5-next-20230606-1686061107994 #1
[12143.467095] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
[12143.475934] RIP: 0010:__bitmap_and+0x18/0x70
[12143.480713] Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 89 cb 49 89 f9 41 c1 eb 06 74 51 45 89 da 31 c0 45 31 c0 <48> 8b 3c c6 48 23 3c c2 49 89 3c c1 48 83 c0 01 49 09 f8 49 39 c2
[12143.501675] RSP: 0018:ffffa3a90db70d90 EFLAGS: 00010046
[12143.507510] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000080
[12143.515468] RDX: ffff8a1ec17a1d68 RSI: 0000000000000128 RDI: ffff8a1e800429c0
[12143.523425] RBP: ffff8a1ec17a1980 R08: 0000000000000000 R09: ffff8a1e800429c0
[12143.531385] R10: 0000000000000002 R11: 0000000000000002 R12: ffff8a1e800429c0
[12143.539352] R13: 0000000000000000 R14: 0000000000032580 R15: 0000000000000000
[12143.547320] FS: 0000000000000000(0000) GS:ffff8a2dbf100000(0000) knlGS:0000000000000000
[12143.556354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12143.562770] CR2: 0000000000000128 CR3: 0000003089e50000 CR4: 00000000003506e0
[12143.570729] Call Trace:
[12143.573463] <IRQ>
[12143.575714] ? __die+0x24/0x70
[12143.579130] ? page_fault_oops+0x82/0x150
[12143.583615] ? exc_page_fault+0x69/0x150
[12143.588001] ? asm_exc_page_fault+0x26/0x30
[12143.592678] ? __bitmap_and+0x18/0x70
[12143.596768] select_idle_cpu+0x84/0x3d0
[12143.601059] select_idle_sibling+0x1b7/0x500
[12143.605831] select_task_rq_fair+0x1b2/0x2e0
[12143.610603] select_task_rq+0x7a/0xc0
[12143.614696] try_to_wake_up+0xe8/0x550
[12143.618885] ? update_process_times+0x83/0x90
[12143.623747] ? __pfx_hrtimer_wakeup+0x10/0x10
[12143.628615] hrtimer_wakeup+0x22/0x30
[12143.632706] __hrtimer_run_queues+0x112/0x2b0
[12143.637574] hrtimer_interrupt+0x100/0x240
[12143.642152] __sysvec_apic_timer_interrupt+0x63/0x130
[12143.647796] sysvec_apic_timer_interrupt+0x71/0x90
[12143.653149] </IRQ>
[12143.655493] <TASK>
[12143.657834] asm_sysvec_apic_timer_interrupt+0x1a/0x20
I'm thikning this is because of ("sched/fair: Multi-LLC
select_idle_sibling()") which I've already dropped from tip/sched/core
(and should hopefully also dissapear from -next if it hasn't already).
Yes, that was very likely it, i don't see any rcutorture test failure in
today's next(next-20230609) build.
Also see:
https://lkml.kernel.org/r/20230605175636.GA4253@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
That sounds a lot easier than bisecting, thank you!
Thanx, Paul
Thank you both for your help.
Thanks & Regards,
Ayush Jain