Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 15, 2022 at 10:33:00AM -0700, Luis Chamberlain wrote:
> On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> > No idea, however, why this would happen when using qemu as opposed to
> > nbd.
> 
> Sorry I keep saying nbd often when I mean null_blk, but yeah, same concept,
> in that I meant an nvme controller is involved *with* the RCU splat.
> 
> The qemu nvme driver and qemu controller would be one code run time surface
> to conside here, the two consecutive nvme controller IO timeouts seem
> to be telling of a backlog of some sort:
> 
> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> [493336.981666] nvme nvme9: Abort status: 0x0
> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> <-- rcu possible stall and NMIs-->
> [493426.706021] nvme nvme9: 8/0/0 default/read/poll queues
> 
> So that seems to be pattern here. First puzzle is *why* we end up with
> the IO timeout on the qemu nvme controller. And for this reason I CC'd
> Klaus, in case he might have any ideas.
> 
> The first "timeout, aborting" happens when we have a first timeout of
> a request for IO tag 192 sent to the controller and it just times out.
> This seems to happen about 30 seconds after the test zbd/006 starts.
> 
> The "reset controller" will happen if we already issued the IO command
> with tag 192 before and had already timed out before and hadn't been returned
> to the driver. Resetting the controller is the stop gap measure.
> 
> The nvme_timeout() is just shared blk_mq_ops between nvme_mq_admin_ops
> and nvme_mq_ops.
> 
> Inspecting the logs of running test zbd/005 and zbd/006 in a loop I can
> see that not all nvme resets of the controller end up in the RCU stall splat
> though.
> 
> The timeout of an IO happens about after 30 seconds the test starts, then
> the reset 30 seconds after that first timeout, then the RCU stall about
> 30 seconds after. For instance follow these now mapped times:
> 
> Apr 13 23:02:33 linux517-blktests-zbd unknown: run blktests zbd/006 at 2022-04-13 23:02:33
> 30 seconds go by ...
> Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, aborting
> Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: Abort status: 0x0
> 30 seconds go by ...
> Apr 13 23:03:36 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, reset controller
> 30 seconds go by ...
> Apr 13 23:04:04 linux517-blktests-zbd kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 
> So the reset happens and yet we stay put. The reset work for the PCI
> nvme controller is on nvme_reset_work() drivers/nvme/host/pci.c and
> I don't see that issue'ing stuff back up the admin queue, however,
> the first IO timeout did send an admin abort, on the same file:
> 
> drivers/nvme/host/pci.c
> static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> {
> 	...
> 	dev_warn(nvmeq->dev->ctrl.device,                                       
>                 "I/O %d QID %d timeout, aborting\n",                            
>                  req->tag, nvmeq->qid);                                         
>                                                                                 
>         abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd),  
>                                          BLK_MQ_REQ_NOWAIT);                    
>         if (IS_ERR(abort_req)) {                                                
>                 atomic_inc(&dev->ctrl.abort_limit);                             
>                 return BLK_EH_RESET_TIMER;                                      
>         }                                                                       
>         nvme_init_request(abort_req, &cmd);                                     
>                                                                                 
>         abort_req->end_io_data = NULL;                                          
>         blk_execute_rq_nowait(abort_req, false, abort_endio);  
> 	...
> }
> 
> So this is all seems by design. What's puzzling is why it takes
> about 30 seconds to get to the possible RCU stall splat *after* we
> reset the nvme controller. It make me wonder if we left something
> lingering in a black hole somewhere after our first reset. Is it
> right to gather from the RCU splat that whatever it was, it was stuck
> in a long spin lock?

One way that this could happen would be for an earlier RCU grace period
to have taken up the first 9 seconds of that 30-second interval.

Tracing might help here.  The rcu_gp_init() function is invoked at the
beginning of a grace period and the rcu_gp_cleanup() function is invoked
at its end.  Either way, it might be interesting to see how the grace
periods line up with those other events.

Oh, and another way that this sort of thing can happen is if something
prevents the jiffies counter from advancing.  Rare, but it does happen,
for example: https://paulmck.livejournal.com/62071.html

							Thanx, Paul



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux