On Wed, Sep 16, 2020 at 01:48:22PM -0700, Nick Desaulniers wrote: > Hey Paul and RCU folks, > I noticed we have a bug report from 2 users that seem to have similar > stack traces in SRCU code; > https://github.com/ClangBuiltLinux/linux/issues/1081 > > Is there a way we should go about starting to debug this? Hello, Nick, Huh. It looks like the per-CPU memory referenced by the srcu_struct structure's ->sda field is unmapped. That would certainly leave the next __srcu_read_lock() dazed and confused! The trapping instruction is the increment instruction that I would expect to be there. The source code is as follows: idx = READ_ONCE(ssp->srcu_idx) & 0x1; this_cpu_inc(ssp->sda->srcu_lock_count[idx]); smp_mb(); Looking at the assembly: 1e: 55 push %ebp 1f: 89 e5 mov %esp,%ebp The above is function preamble. 21: 8b 48 68 mov 0x68(%eax),%ecx The above instruction does READ_ONCE(ssp->srcu_idx). 24: 8b 40 7c mov 0x7c(%eax),%eax The above instruction fetches ssp->sda into %eax. I therefore find it quite surprising that the dump contains "EAX: 00000000". Or is this register value inaccurate? 27: 83 e1 01 and $0x1,%ecx The above instruction does the "& 0x1". Therefore, at this point, %eax contains the address of the per-CPU srcu_data structure, but without the per-CPU offset having been applied. Also, %ecx contains the array index, either 0 or 1. Here we have zero, which is perfectly legitimate. 2a:* 64 ff 04 88 incl %fs:(%eax,%ecx,4) The above instruction does the this_cpu_inc(). Here %fs is presumably this CPU's offset from the base address of the per-CPU ->sda pointer. 2e: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%esp) The above instruction is the smp_mb(). So here are a few questions that I would ask: 1. Did the init_srcu_struct() for this srcu_struct report an error? (Though with current mainline, that memory-allocation failure would more likely have page-faulted in init_srcu_struct().) 2. Has the srcu_struct in question already been passed to cleanup_srcu_struct()? 3. Has the value of %fs been clobbered? Though that seems unlikely given that it also happens on aarch64. Plus, the smoking gun seems to me to be the zero value of %eax. 4. If the above three questions fail to provide enlightenment, I suggest recording the ->sda value and adding debug checks to anything that can unmap memory... And recording the value of ->sda somewhere to check to see if it is being changed (it should remain constant from init_srcu_struct()'s return through the corresponding call to cleanup_srcu_struct()). Please let me know how it goes! Thanx, Paul