Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Fri, 15 Apr 2022 10:33:00 -0700

On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> No idea, however, why this would happen when using qemu as opposed to
> nbd.

Sorry I keep saying nbd often when I mean null_blk, but yeah, same concept,
in that I meant an nvme controller is involved *with* the RCU splat.

The qemu nvme driver and qemu controller would be one code run time surface
to conside here, the two consecutive nvme controller IO timeouts seem
to be telling of a backlog of some sort:

[493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
[493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
[493336.981666] nvme nvme9: Abort status: 0x0
[493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
<-- rcu possible stall and NMIs-->
[493426.706021] nvme nvme9: 8/0/0 default/read/poll queues

So that seems to be pattern here. First puzzle is *why* we end up with
the IO timeout on the qemu nvme controller. And for this reason I CC'd
Klaus, in case he might have any ideas.

The first "timeout, aborting" happens when we have a first timeout of
a request for IO tag 192 sent to the controller and it just times out.
This seems to happen about 30 seconds after the test zbd/006 starts.

The "reset controller" will happen if we already issued the IO command
with tag 192 before and had already timed out before and hadn't been returned
to the driver. Resetting the controller is the stop gap measure.

The nvme_timeout() is just shared blk_mq_ops between nvme_mq_admin_ops
and nvme_mq_ops.

Inspecting the logs of running test zbd/005 and zbd/006 in a loop I can
see that not all nvme resets of the controller end up in the RCU stall splat
though.

The timeout of an IO happens about after 30 seconds the test starts, then
the reset 30 seconds after that first timeout, then the RCU stall about
30 seconds after. For instance follow these now mapped times:

Apr 13 23:02:33 linux517-blktests-zbd unknown: run blktests zbd/006 at 2022-04-13 23:02:33
30 seconds go by ...
Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, aborting
Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: Abort status: 0x0
30 seconds go by ...
Apr 13 23:03:36 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, reset controller
30 seconds go by ...
Apr 13 23:04:04 linux517-blktests-zbd kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

So the reset happens and yet we stay put. The reset work for the PCI
nvme controller is on nvme_reset_work() drivers/nvme/host/pci.c and
I don't see that issue'ing stuff back up the admin queue, however,
the first IO timeout did send an admin abort, on the same file:

drivers/nvme/host/pci.c
static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
{
	...
	dev_warn(nvmeq->dev->ctrl.device,                                       
                "I/O %d QID %d timeout, aborting\n",                            
                 req->tag, nvmeq->qid);                                         

        abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd),  
                                         BLK_MQ_REQ_NOWAIT);                    
        if (IS_ERR(abort_req)) {                                                
                atomic_inc(&dev->ctrl.abort_limit);                             
                return BLK_EH_RESET_TIMER;                                      
        }                                                                       
        nvme_init_request(abort_req, &cmd);                                     

        abort_req->end_io_data = NULL;                                          
        blk_execute_rq_nowait(abort_req, false, abort_endio);  
	...
}

So this is all seems by design. What's puzzling is why it takes
about 30 seconds to get to the possible RCU stall splat *after* we
reset the nvme controller. It make me wonder if we left something
lingering in a black hole somewhere after our first reset. Is it
right to gather from the RCU splat that whatever it was, it was stuck
in a long spin lock?

  Luis