Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall

Davidlohr Bueso <dave@xxxxxxxxxxxx> · Thu, 14 Apr 2022 18:09:45 -0700

On Thu, 14 Apr 2022, Luis Chamberlain wrote:

Hey folks,

While enhancing kdevops [0] to embrace automation of testing with
blktests for ZNS I ended up spotting a possible false positive RCU stall
when running zbd/006 after zbd/005. The curious thing though is that
this possible RCU stall is only possible when using the qemu
ZNS drive, not when using nbd. In so far as kdevops is concerned
it creates ZNS drives for you when you enable the config option
CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
suffices. When configuring blktests you can just enable the zbd
guest, so only a pair of guests are reated the zbd guest and the
respective development guest, zbd-dev guest. When using
CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
just two guests:

 * linux517-blktests-zbd
 * linux517-blktests-zbd-dev

The RCU stall can be triggered easily as follows:

make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
make
make bringup # bring up guests
make linux # build and boot into v5.17-rc7
make blktests # build and install blktests

Now let's ssh to the guest while leaving a console attached
with `sudo virsh vagrant_linux517-blktests-zbd` in a window:

ssh linux517-blktests-zbd
sudo su -
cd /usr/local/blktests
export TEST_DEVS=/dev/nvme9n1
i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;

The above should never fail, but you should eventually see an RCU
stall candidate on the console. The full details can be observed on the
gist [1] but for completeness I list some of it below. It may be a false
positive at this point, not sure.

[493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
[493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
[493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
[493336.981666] nvme nvme9: Abort status: 0x0
[493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
[493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[493425.817272] rcu:    4-....: (0 ticks this GP) idle=c48/0/0x0 softirq=11316030/11316030 fqs=939  (false positive?)
[493425.819275]         (detected by 7, t=14522 jiffies, g=31237493, q=6271)

Ok so CPU-7 detected stalls on CPU-4, which is in dyntick-idle mode,
which is an extended quiescent state (EQS) to overcome the limitations of
not having a tick (NO_HZ). So the false positive looks correct here in
that idle threads in this state are not in fact blocking the grace period
kthread.

No idea, however, why this would happen when using qemu as opposed to
nbd.

Thanks,
Davidlohr