Patch "nvme: make keep-alive synchronous operation" has been added to the 6.1-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Thu, 14 Nov 2024 07:43:41 -0500

This is a note to let you know that I've just added the patch titled

    nvme: make keep-alive synchronous operation

to the 6.1-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     nvme-make-keep-alive-synchronous-operation.patch
and it can be found in the queue-6.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 94e0d57b1c74287318fc37cdae24f943e4da088f
Author: Nilay Shroff <nilay@xxxxxxxxxxxxx>
Date:   Wed Oct 16 08:33:15 2024 +0530

    nvme: make keep-alive synchronous operation
    
    [ Upstream commit d06923670b5a5f609603d4a9fee4dec02d38de9c ]
    
    The nvme keep-alive operation, which executes at a periodic interval,
    could potentially sneak in while shutting down a fabric controller.
    This may lead to a race between the fabric controller admin queue
    destroy code path (invoked while shutting down controller) and hw/hctx
    queue dispatcher called from the nvme keep-alive async request queuing
    operation. This race could lead to the kernel crash shown below:
    
    Call Trace:
        autoremove_wake_function+0x0/0xbc (unreliable)
        __blk_mq_sched_dispatch_requests+0x114/0x24c
        blk_mq_sched_dispatch_requests+0x44/0x84
        blk_mq_run_hw_queue+0x140/0x220
        nvme_keep_alive_work+0xc8/0x19c [nvme_core]
        process_one_work+0x200/0x4e0
        worker_thread+0x340/0x504
        kthread+0x138/0x140
        start_kernel_thread+0x14/0x18
    
    While shutting down fabric controller, if nvme keep-alive request sneaks
    in then it would be flushed off. The nvme_keep_alive_end_io function is
    then invoked to handle the end of the keep-alive operation which
    decrements the admin->q_usage_counter and assuming this is the last/only
    request in the admin queue then the admin->q_usage_counter becomes zero.
    If that happens then blk-mq destroy queue operation (blk_mq_destroy_
    queue()) which could be potentially running simultaneously on another
    cpu (as this is the controller shutdown code path) would forward
    progress and deletes the admin queue. So, now from this point onward
    we are not supposed to access the admin queue resources. However the
    issue here's that the nvme keep-alive thread running hw/hctx queue
    dispatch operation hasn't yet finished its work and so it could still
    potentially access the admin queue resource while the admin queue had
    been already deleted and that causes the above crash.
    
    This fix helps avoid the observed crash by implementing keep-alive as a
    synchronous operation so that we decrement admin->q_usage_counter only
    after keep-alive command finished its execution and returns the command
    status back up to its caller (blk_execute_rq()). This would ensure that
    fabric shutdown code path doesn't destroy the fabric admin queue until
    keep-alive request finished execution and also keep-alive thread is not
    running hw/hctx queue dispatch operation.
    
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Signed-off-by: Nilay Shroff <nilay@xxxxxxxxxxxxx>
    Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dc25d91891327..92ffeb6605618 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1231,10 +1231,9 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 			   nvme_keep_alive_work_period(ctrl));
 }
 
-static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
-						 blk_status_t status)
+static void nvme_keep_alive_finish(struct request *rq,
+		blk_status_t status, struct nvme_ctrl *ctrl)
 {
-	struct nvme_ctrl *ctrl = rq->end_io_data;
 	unsigned long flags;
 	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
@@ -1252,13 +1251,11 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		delay = 0;
 	}
 
-	blk_mq_free_request(rq);
-
 	if (status) {
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
-		return RQ_END_IO_NONE;
+		return;
 	}
 
 	ctrl->ka_last_check_time = jiffies;
@@ -1270,7 +1267,6 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	if (startka)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
-	return RQ_END_IO_NONE;
 }
 
 static void nvme_keep_alive_work(struct work_struct *work)
@@ -1279,6 +1275,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
 			struct nvme_ctrl, ka_work);
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
+	blk_status_t status;
 
 	ctrl->ka_last_check_time = jiffies;
 
@@ -1301,9 +1298,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	nvme_init_request(rq, &ctrl->ka_cmd);
 
 	rq->timeout = ctrl->kato * HZ;
-	rq->end_io = nvme_keep_alive_end_io;
-	rq->end_io_data = ctrl;
-	blk_execute_rq_nowait(rq, false);
+	status = blk_execute_rq(rq, false);
+	nvme_keep_alive_finish(rq, status, ctrl);
+	blk_mq_free_request(rq);
 }
 
 static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)