Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface

Sagi Grimberg <sagi@xxxxxxxxxxx> · Mon, 27 Jul 2020 20:29:43 -0700

+static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	blk_mq_quiesce_queue_nowait(q);
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
+		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
+		if (!hctx->rcu_sync)
+			continue;

This approach of quiesce/unquiesce tagset is good abstraction.

Just one more thing, please allocate a rcu_sync array because hctx is
supposed to not store scratch stuff.

I'd be all for not stuffing this in the hctx, but how would that work?
The only thing I can think of that would work reliably is batching the
queue+wait into units of N. We could potentially have many thousands of
queues, and it could get iffy (and/or unreliable) in terms of allocation
size. Looks like rcu_synchronize is 48-bytes on my local install, and it
doesn't take a lot of devices at current CPU counts to make an alloc
covering all of it huge. Let's say 64 threads, and 32 devices, then
we're already at 64*32*48 bytes which is an order 5 allocation. Not
friendly, and not going to be reliable when you need it. And if we start
batching in reasonable counts, then we're _almost_ back to doing a queue
or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
the time for single page allocations.

We can convert to order 0 allocation by one extra indirect array.

I guess that could work, and would just be one extra alloc + free if we
still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
round, potentially way less of course if we have more CPUs. So still
somewhat limiting, rather than do all at once.

With the approach in blk_mq_alloc_rqs(), each allocated page can be
added to one list, so the indirect array can be saved. Then it is
possible to allocate for any size queues/devices since every
allocation is just for single page in case that it is needed, even no
pre-calculation is required.

As long as we watch the complexity, don't think we need to go overboard
here in the risk of adding issues for the failure path.

No we don't. I prefer not to do it. And if this turns out to be that bad
we can later convert it to a complicated page vector.

I'll move forward with this approach.