Re: [PATCH v5 00/14] blk-mq: Reduce static requests memory footprint for shared sbitmap

John Garry <john.garry@xxxxxxxxxx> · Fri, 8 Oct 2021 11:17:35 +0100

On 07/10/2021 21:31, Kashyap Desai wrote:
Perf top data indicates lock contention in "blk_mq_find_and_get_req" call.

1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
      native_queued_spin_lock_slowpath
      ret_from_fork
      kthread
      worker_thread
      process_one_work
      blk_mq_timeout_work
      blk_mq_queue_tag_busy_iter
      bt_iter
      blk_mq_find_and_get_req
      _raw_spin_lock_irqsave
      native_queued_spin_lock_slowpath


Kernel v5.14 Data -

%Node1 :  8.4 us, 31.2 sy,  0.0 ni, 43.7 id,  0.0 wa,  0.0 hi, 16.8 si,  0.0
st
      4.46%  [kernel]       [k] complete_cmd_fusion
      3.69%  [kernel]       [k] megasas_build_and_issue_cmd_fusion
      2.97%  [kernel]       [k] blk_mq_find_and_get_req
      2.81%  [kernel]       [k] megasas_build_ldio_fusion
      2.62%  [kernel]       [k] syscall_return_via_sysret
      2.17%  [kernel]       [k] __entry_text_start
      2.01%  [kernel]       [k] io_submit_one
      1.87%  [kernel]       [k] scsi_queue_rq
      1.77%  [kernel]       [k] native_queued_spin_lock_slowpath
      1.76%  [kernel]       [k] scsi_complete
      1.66%  [kernel]       [k] llist_reverse_order
      1.63%  [kernel]       [k] _raw_spin_lock_irqsave
      1.61%  [kernel]       [k] llist_add_batch
      1.39%  [kernel]       [k] aio_complete_rw
      1.37%  [kernel]       [k] read_tsc
      1.07%  [kernel]       [k] blk_complete_reqs
      1.07%  [kernel]       [k] native_irq_return_iret
      1.04%  [kernel]       [k] __x86_indirect_thunk_rax
      1.03%  fio            [.] __fio_gettime
      1.00%  [kernel]       [k] flush_smp_call_function_queue


Test #2: Three VDs (each VD consist of 8 SAS SSDs).
(numactl -N 1 fio
3vd.fio --rw=randread --bs=4k --iodepth=32 --numjobs=8
--ioscheduler=none/mq-deadline)

There is a performance regression but it is not due to this patch set.
Kernel v5.11 gives 2.1M IOPs on mq-deadline but 5.15 (without this patchset)
gives 1.8M IOPs.
In this test I did not noticed CPU issue as mentioned in Test-1.

In general, I noticed host_busy is incorrect once I apply this patchset. It
should not be more than can_queue, but sysfs host_busy value is very high
when IOs are running. This issue is only after applying this patchset.

Is this patch set only change the behavior of <shared_host_tag> enabled
driver ? Will there be any impact on mpi3mr driver ? I can test that as
well.

I can see where the high value of host_busy is coming from in this 
series - we incorrectly re-iter the tags by #hw queues times in 
blk_mq_tagset_busy_iter() - d'oh.

Please try the below patch. I have looked at other places where we may 
have similar problems in looping the hw queue count for tagset->tags[], 
and they look ok. But I will double-check. I think that 
blk_mq_queue_tag_busy_iter() should be fine - Ming?

--->8----

From e6ecaa6d624ebb903fa773ca2a2035300b4c55c5 Mon Sep 17 00:00:00 2001
From: John Garry <john.garry@xxxxxxxxxx>
Date: Fri, 8 Oct 2021 10:55:11 +0100
Subject: [PATCH] blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of hw queues in
that case. Rather it should just iter once.

Signed-off-by: John Garry <john.garry@xxxxxxxxxx>

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..ef888aab81b3 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -378,9 +378,15 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, 
busy_tag_iter_fn *fn,
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
+	int nr_hw_queues;
 	int i;

-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	if (blk_mq_is_shared_tags(tagset->flags))
+		nr_hw_queues = 1;
+	else
+		nr_hw_queues = tagset->nr_hw_queues;
+
+	for (i = 0; i < nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 					      BT_TAG_ITER_STARTED);

----8<----

Thanks,
john