On Sun, May 24, 2020 at 07:33:02AM -0700, Dongli Zhang wrote: > >> After code analysis, I think this is for nvme-pci as well. > >> > >> nvme_process_cq() > >> -> nvme_handle_cqe() > >> -> nvme_end_request() > >> -> blk_mq_complete_request() > >> nvme_reset_work() > >> -> nvme_dev_disable() > >> -> nvme_reap_pending_cqes() > >> -> nvme_process_cq() > >> -> nvme_handle_cqe() > >> -> nvme_end_request() > >> -> blk_mq_complete_request() > >> -> __blk_mq_complete_request() > >> -> __blk_mq_complete_request() > > > > nvme_dev_disable will first disable the queues before reaping the pending cqes so > > it shouldn't have this issue. > > > > Would you mind help explain how nvme_dev_disable() would avoid this issue? > > nvme_dev_disable() would: > > 1. freeze all the queues so that new request would not enter and submit > 2. NOT wait for freezing during live reset so that q->q_usage_counter is not > guaranteed to be zero. > 3. quiesce all the queues so that new request would not dispatch > 4. delete the queue and free irq > > However, I do not find a mechanism to prevent if a nvme_end_request() is already > in progress. > > E.g., suppose __blk_mq_complete_request() is already triggered on cpu 3 and > waiting for its first line "WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)" to be > executed ... while another cpu is doing live reset. I do not see how to prevent > such race. The queues and their interrupts are torn and synchronized down before the reset reclaims uncompleted requests. There's no other context that can be running completions at that point.