Re: [RFC PATCH] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING

Sagi Grimberg <sagi@xxxxxxxxxxx> · Thu, 30 Jul 2020 09:10:48 -0700

In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
section during dispatching request, then request queue quiesce is based on
SRCU. What we want to get is low cost added in fast path.

However, from srcu_read_lock/srcu_read_unlock implementation, not see
it is quicker than percpu refcount, so use percpu_ref to implement
queue quiesce. This usage is cleaner and simpler & enough for implementing
queue quiesce. The main requirement is to make sure all read sections to observe
QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.

Also it becomes much easier to add interface of async queue quiesce.

BTW, no obvious IOPS difference is observed with this patch applied when running
io on null_blk(blocking, submit_queues=32) in one dual-socket, 32cores system.

Thanks Ming, can you test for non-blocking on the same setup?

OK, I can do that, but this patch supposes to not affect non-blocking,
care to share your motivation for testing non-blocking?

I think it will be a significant improvement to have a single code path.
The code will be more robust and we won't need to face issues that are
specific for blocking.

If the cost is negligible, I think the upside is worth it.

rcu_read_lock and rcu_read_unlock has been proved as efficient enough,
and I don't think percpu_refcount is better than it, so I'd suggest to
not switch non-blocking into this way.

It's not a matter of which is better, its a matter of making the code
more robust because it has a single code-path. If moving to percpu_ref
is negligible, I would suggest to move both, I don't want to have two
completely different mechanism for blocking vs. non-blocking.

BTW, in case of blocking, one hctx may dispatch at most one request because there
is only single .run_work, even though when .queue_rq() is slept, that said
blk_mq_submit_bio() queues bio in sync style. This way won't be very efficient.
So percpu_refcount should be good enough for blocking code path, but may not be
well enough for non-blocking case.

Not sure what you mean, the percpu_ref is taken exactly where rcu is
taken, not sure what is the difference.