Re: NULL pointer issue in rcu_do_batch()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 22 Dec 2022 08:40:24 -0800

On Thu, Dec 22, 2022 at 08:11:23AM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 22, 2022, at 6:34 AM, Mukesh Ojha <quic_mojha@xxxxxxxxxxx> wrote:
> > 
> > Hi All,
> > 
> > We are observing NULL pointer dereference issue in rcu_do_batch() in 5.15, although it is very hard to hit.
> > 
> > Wanted to check if it is been reported and fixed in recent kernel ?
> 
> What is the test case? I have not seen such corruption. Is it possible for you to run with CONFIG_PROVE_RCU?

What Joel said!

Another common cause of this is double call_rcu(), free-after-call_rcu(),
or similar.  CONFIG_DEBUG_OBJECTS_RCU_HEAD can help track these down,
and KASAN can also be helpful.

							Thanx, Paul

> This looks like an Android kernel, I can tell by looking at VendorHooks in the log. So with all that GKI stuff, are we sure that is not causing some unforeseen side effect ?
> 
> Thanks,
> 
>  - Joel
> 
> 
> > <1>[16.814014] [pid:    58] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> > <0>[16.814027] [pid:    58] PC Code: bad value
> > <0>[16.814034] [pid:    58] LR Code: f81e03a8 b5000068 d10083a8 f81e83a8 aa1f03f6 91127319 d10083b7 f9434b68 d503201f f9400408 910006d6 f900041f d63f0100 (91004308) b8bfc108 374001c8 97ffff2b 9111e308 38bfc108 72001d1f
> > 
> > <4>[16.814359] [pid:    58] CPU: 7 PID: 58 Comm: rcuop/5 Tainted: G S   W  OE     5.15.41-android13-8-25574579-abS911USQU1AVLL #1
> > <4>[16.814361] [pid:    58] Hardware name: XXXXX
> > <4>[16.814362] [pid:    58] pstate: 42400805 (nZcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=-c)
> > <4>[16.814364] [pid:    58] pc : 0x0
> > <4>[16.814365] [pid:    58] lr : rcu_do_batch+0x328/0xcd8
> > 
> > 
> > rcu_data for CPU5 contains additional 12 RCU callback heads in the segment of RCU_DONE_TAIL whose func is NULL. It doesn’t seem to be a random memory corruption since only rhp->func is set to null across multiple objects.
> > 
> > There is one more occurrence with CONFIG_CFI_CLANG enabled.
> > 
> > [123587.101222][   T44] Kernel panic - not syncing: CFI failure (target: 0x0)
> > [123587.101249][   T44] CPU: 0 PID: 44 Comm: rcuop/3 Tainted: G S WC OE     5.15.41 #1
> > [123587.101263][   T44] Hardware name: XXXXX
> > [123587.101274][   T44] Call trace:
> > [123587.101283][   T44]  dump_backtrace.cfi_jt+0x0/0x8
> > [123587.101298][   T44]  show_stack+0x1c/0x2c
> > [123587.101311][   T44]  dump_stack_lvl+0x94/0x100
> > [123587.101326][   T44]  panic+0x17c/0x450
> > [123587.101338][   T44]  find_check_fn+0x0/0x210
> > [123587.101349][   T44]  rcu_do_batch+0x368/0x6f8
> > [123587.101362][   T44]  nocb_cb_wait+0x80/0x450
> > [123587.101374][   T44]  rcu_nocb_cb_kthread+0x54/0x90
> > [123587.101386][   T44]  kthread+0x174/0x1d8
> > [123587.101398][   T44]  ret_from_fork+0x10/0x20
> > [123587.101410][   T44] SMP: stopping secondary CPUs
> > [123587.101670][    C4] VendorHooks: CPU4: stopping
> > 
> > -Mukesh