On Tue, Jul 11, 2023 at 04:48:06PM +0800, kernel test robot wrote: > > > Hello, > > kernel test robot noticed "WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass" on: > > commit: 7625926086765123251f765d91fc3a70617d334d ("rcu/nocb: Fix shrinker race against callback enqueuer") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > [test failed on linus/master Linux 6.5-rc1] > [test failed on linux-next/master fe57d0d86f03a8b2afe2869a95477d0ed1824c96] > > in testcase: trinity > version: trinity-i386-abe9de86-1_20230429 > with following parameters: > > runtime: 300s > group: group-00 > > test-description: Trinity is a linux system call fuzz tester. > test-url: http://codemonkey.org.uk/projects/trinity/ > > > compiler: gcc-12 > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > (please refer to attached dmesg/kmsg for entire log/backtrace): > > > the WARNING only happens in a low rate (10 out of 204 runs as below). > but since we never observed the issue on parent, and we also noticed > there is a WARN_ON_ONCE in code change, so just report what we observed > in our tests FYI. > > 5c83cedbaaad6dfe 7625926086765123251f765d91f > ---------------- --------------------------- > fail:runs %reproduction fail:runs > | | | > :202 5% 10:204 dmesg.RIP:rcu_nocb_bypass_lock > :202 5% 10:204 dmesg.RIP:rcu_nocb_try_bypass > :202 5% 10:204 dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_bypass_lock > :202 5% 10:204 dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > | Closes: https://lore.kernel.org/oe-lkp/202307111535.4823c255-oliver.sang@xxxxxxxxx > > > > [ 315.569574][ T16] WARNING: CPU: 1 PID: 16 at kernel/rcu/tree_nocb.h:124 > rcu_nocb_try_bypass (kernel/rcu/tree_nocb.h:124 kernel/rcu/tree_nocb.h:513) It's: static void rcu_nocb_wait_contended(struct rcu_data *rdp) { WARN_ON_ONCE(smp_processor_id() != rdp->cpu); while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))) <<< HERE cpu_relax(); } This is called on enqueue time by rcu_nocb_try_bypass(). The places that can increment ->nocb_lock_contended are: * rcu_nocb_try_bypass() itself, but it's not re-entrant * rcu_nocb_rdp_deoffload(), but it executes on the local CPU with IRQs disabled, so rcu_nocb_try_bypass() can't race * lazy_rcu_shrink_scan(): that's indeed the issue here * rcu_barrier_entrain(), executed either locally with IRQs disabled or remotely when the target is dead, so can't race with rcu_nocb_try_bypass() * rcutree_migrate_callback(): locally with IRQs disabled, can't race So indeed lazy_rcu_shrink_scan() is the only possible culprit. But then I wonder, what is the purpose of this ->nocb_lock_contended if it's actually not an allowed behaviour? Paul, any recollection of that? Thanks.