On Tue, 6 Dec 2022 at 04:34, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Dec 05, 2022 at 07:12:15PM -0800, syzbot wrote: > > Hello, > > > > syzbot has tested the proposed patch but the reproducer is still triggering an issue: > > INFO: rcu detected stall in corrupted > > > > rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P4122 } 2641 jiffies s: 2877 root: 0x0/T > > rcu: blocking rcu_node structures (internal RCU debug): > > I'm pretty sure this has nothing to do with the reproducer - the > console log here: > > > Tested on: > > > > commit: bce93322 proc: proc_skip_spaces() shouldn't think it i.. > > git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > console output: https://syzkaller.appspot.com/x/log.txt?x=1566216b880000 > > indicates that syzbot is screwing around with bluetooth, HCI, > netdevsim, bridging, bonding, etc. > > There's no evidence that it actually ran the reproducer for the bug > reported in this thread - there's no record of a single XFS > filesystem being mounted in the log.... > > It look slike someone else also tried a private patch to fix this > problem (which was obviously broken) and it failed with exactly the > same RCU warnings. That was run from the same commit id as the > original reproducer, so this looks like either syzbot is broken or > there's some other completely unrelated problem that syzbot is > tripping over here. > > Over to the syzbot people to debug the syzbot failure.... Hi Dave, It's not uncommon for a single program to trigger multiple bugs. That's what happens here. The rcu stall issue is reproducible with this test program. In such cases you can either submit more test requests, or test manually. I think there is an RCU expedited stall detection. For some reason CONFIG_RCU_EXP_CPU_STALL_TIMEOUT is limited to 21 seconds, and that's not enough for reliable flake-free stress testing. We bump other timeouts to 100+ seconds. +RCU maintainers, do you mind removing the overly restrictive limit on CONFIG_RCU_EXP_CPU_STALL_TIMEOUT? Or you think there is something to fix in the kernel to not stall? I see the test writes to /proc/sys/vm/drop_caches, maybe there is some issue in that code.