On Wed, Sep 16, 2020 at 02:37:30PM -0700, Paul E. McKenney wrote: > On Wed, Sep 16, 2020 at 01:48:22PM -0700, Nick Desaulniers wrote: > > Hey Paul and RCU folks, > > I noticed we have a bug report from 2 users that seem to have similar > > stack traces in SRCU code; > > https://github.com/ClangBuiltLinux/linux/issues/1081 > > > > Is there a way we should go about starting to debug this? > > Hello, Nick, > > Huh. It looks like the per-CPU memory referenced by the srcu_struct > structure's ->sda field is unmapped. That would certainly leave > the next __srcu_read_lock() dazed and confused! > > The trapping instruction is the increment instruction that I would > expect to be there. The source code is as follows: > > idx = READ_ONCE(ssp->srcu_idx) & 0x1; > this_cpu_inc(ssp->sda->srcu_lock_count[idx]); > smp_mb(); > > Looking at the assembly: > > 1e: 55 push %ebp > 1f: 89 e5 mov %esp,%ebp > > The above is function preamble. > > 21: 8b 48 68 mov 0x68(%eax),%ecx > > The above instruction does READ_ONCE(ssp->srcu_idx). > > 24: 8b 40 7c mov 0x7c(%eax),%eax > > The above instruction fetches ssp->sda into %eax. I therefore find it > quite surprising that the dump contains "EAX: 00000000". Or is this > register value inaccurate? > > 27: 83 e1 01 and $0x1,%ecx > > The above instruction does the "& 0x1". Therefore, at this point, > %eax contains the address of the per-CPU srcu_data structure, but > without the per-CPU offset having been applied. Also, %ecx contains > the array index, either 0 or 1. Here we have zero, which is perfectly > legitimate. > > 2a:* 64 ff 04 88 incl %fs:(%eax,%ecx,4) > > The above instruction does the this_cpu_inc(). Here %fs is presumably > this CPU's offset from the base address of the per-CPU ->sda pointer. > > 2e: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%esp) > > The above instruction is the smp_mb(). > > So here are a few questions that I would ask: Oh, and this one: 0. Did someone call srcu_read_lock() before init_srcu_struct() had been called on this srcu_struct structure? Thanx, Paul > 1. Did the init_srcu_struct() for this srcu_struct report an error? > (Though with current mainline, that memory-allocation failure > would more likely have page-faulted in init_srcu_struct().) > > 2. Has the srcu_struct in question already been passed to > cleanup_srcu_struct()? > > 3. Has the value of %fs been clobbered? Though that seems > unlikely given that it also happens on aarch64. Plus, the > smoking gun seems to me to be the zero value of %eax. > > 4. If the above three questions fail to provide enlightenment, > I suggest recording the ->sda value and adding debug checks > to anything that can unmap memory... And recording the value > of ->sda somewhere to check to see if it is being changed (it > should remain constant from init_srcu_struct()'s return through > the corresponding call to cleanup_srcu_struct()). > > Please let me know how it goes! > > Thanx, Paul