Re: nvme double __blk_mq_complete_request() bugs

Keith Busch <kbusch@xxxxxxxxxx> · Mon, 25 May 2020 10:45:16 -0600

On Sun, May 24, 2020 at 07:33:02AM -0700, Dongli Zhang wrote:
> >> After code analysis, I think this is for nvme-pci as well.
> >>
> >>                                         nvme_process_cq()
> >>                                         -> nvme_handle_cqe()
> >>                                            -> nvme_end_request()
> >>                                               -> blk_mq_complete_request()
> >> nvme_reset_work()
> >> -> nvme_dev_disable()
> >>     -> nvme_reap_pending_cqes()
> >>        -> nvme_process_cq()
> >>           -> nvme_handle_cqe()
> >>              -> nvme_end_request()
> >>                 -> blk_mq_complete_request()
> >>                    -> __blk_mq_complete_request()
> >>                                                  -> __blk_mq_complete_request()
> > 
> > nvme_dev_disable will first disable the queues before reaping the pending cqes so
> > it shouldn't have this issue.
> > 
> 
> Would you mind help explain how nvme_dev_disable() would avoid this issue?
> 
> nvme_dev_disable() would:
> 
> 1. freeze all the queues so that new request would not enter and submit
> 2. NOT wait for freezing during live reset so that q->q_usage_counter is not
> guaranteed to be zero.
> 3. quiesce all the queues so that new request would not dispatch
> 4. delete the queue and free irq
> 
> However, I do not find a mechanism to prevent if a nvme_end_request() is already
> in progress.
> 
> E.g., suppose __blk_mq_complete_request() is already triggered on cpu 3 and
> waiting for its first line "WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)" to be
> executed ... while another cpu is doing live reset. I do not see how to prevent
> such race.

The queues and their interrupts are torn and synchronized down before the reset
reclaims uncompleted requests. There's no other context that can be running
completions at that point.