Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support

Bart Van Assche <bvanassche@xxxxxxx> · Wed, 17 Jan 2024 16:43:29 -0800

On 1/17/24 13:40, Jens Axboe wrote:
On 1/17/24 2:33 PM, Bart Van Assche wrote:
Please note that whether or not spin_trylock() is used, there is a
race condition in this approach: if dd_dispatch_request() is called
just before another CPU calls spin_unlock() from inside
dd_dispatch_request() then some requests won't be dispatched until the
next time dd_dispatch_request() is called.

Sure, that's not surprising. What I cared most about here is that we
should not have a race such that we'd stall. Since we haven't returned
this request just yet if we race, we know at least one will be issued
and we'll re-run at completion. So yeah, we may very well skip an issue,
that's well known within that change, which will be postponed to the
next queue run.

The patch is more to demonstrate that it would not take much to fix this
case, at least, it's a proof-of-concept.

The patch below implements what has been discussed in this e-mail
thread. I do not recommend to apply this patch since it reduces single-
threaded performance by 11% on an Intel Xeon Processor (Skylake, IBRS):

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..d83831ced69a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -84,6 +84,10 @@ struct deadline_data {
 	 * run time data
 	 */

+	spinlock_t lock;
+	spinlock_t dispatch_lock;
+	spinlock_t zone_lock;
+
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];

 	/* Data direction of latest dispatched request. */
@@ -100,9 +104,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };

 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +601,16 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;

+	/*
+	 * Reduce lock contention on dd->lock by re-running the queue
+	 * asynchronously if another CPU core is already executing
+	 * dd_dispatch_request().
+	 */
+	if (!spin_trylock(&dd->dispatch_lock)) {
+		blk_mq_delay_run_hw_queue(hctx, 0);
+		return NULL;
+	}
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -617,6 +628,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)

 unlock:
 	spin_unlock(&dd->lock);
+	spin_unlock(&dd->dispatch_lock);

 	return rq;
 }
@@ -723,6 +735,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
 	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->dispatch_lock);
 	spin_lock_init(&dd->zone_lock);

 	/* We dispatch from request queue wide instead of hw queue */