Re: [PATCH] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING

Sagi Grimberg <sagi@xxxxxxxxxxx> · Mon, 24 Aug 2020 22:24:07 -0700

Anyways, I think that for now we should place them together.

Then it may hurt non-blocking.

Each hctx has only one run-work, if the hctx is blocked, no other request
may be queued to hctx any more. That is basically sync run queue, so I
am not sure good enough perf can be expected on blocking.

I don't think that you should assume that a blocking driver will block
normally, it will only rarely block (very rarely).

If nvme-tcp only blocks rarely, just wondering why not switch to non-blocking
which can be done simply with one driver specific wq work? Then nvme-tcp
can be aligned with other nvme drivers.

It used to be this way (and also is that way today in some cases), but
some latency recent optimizations revealed that sending the request to
the wire from queue_rq (when some conditions are met) instead of
incurring a context switch is a win in most cases where latency matters.

Once we call sendpage from queue_rq, we might_sleep, hence we must be
blocking. But in practice, sendpage with MSG_DONTWAIT will rarely
actually sleep.

So it may not be worth of putting the added .dispatch_counter together
with .q_usage_counter.

I happen to think it would. Not sure why you resist so much given how
request_queue is arranged currently.

The reason is same with 073196787727("blk-mq: Reduce blk_mq_hw_ctx size").

percpu_ref probably is a quarter of the size of srcu, not sure anyone
would have bothered to do that for percpu_ref. You're really
exaggerating I think...

non-blocking is the preferred style for blk-mq driver, so we can just
focus on non-blocking wrt. performance improvement as I mentioned blocking
has big problem of sync run queue.

It may be contradictory for improving both, for example, if the
added .dispatch_counter is put with .q_usage_cunter together, it will
be fetched to L1 unnecessarily which is definitely not good for
non-blocking.

I'll cease asking you for this, but your resistance is really unclear to 
me. We can measure what is the penalty/gain later by realigning some
items.

Let's stop wasting our time here...

Also maybe a better name is needed here since it's just
for blocking hctxs.

+	wait_queue_head_t	mq_quiesce_wq;
+
     	struct dentry		*debugfs_dir;
     #ifdef CONFIG_BLK_DEBUG_FS

What I think is needed here is at a minimum test quiesce/unquiesce loops
during I/O. code auditing is not enough, there may be driver assumptions
broken with this change (although I hope there shouldn't be).

We have elevator switch / updating nr_request stress test, and both relies
on quiesce/unquiesce, and I did run such test with this patch.

You have a blktest for this? If not, I strongly suggest that one is
added to validate the change also moving forward.

There are lots of blktest tests doing that, such as block/005,
block/016, block/021, ...

Good, but I'd also won't want to get this without making sure the async
quiesce works well on large number of namespaces (the reason why this
is proposed in the first place). Not sure who is planning to do that...

That can be added when async quiesce is done.

Chao, are you looking into that? I'd really hate to find out we have an
issue there post conversion...