On 24/09/2021 11:39, John Garry wrote:
+ Kashyap
On 24/09/2021 11:23, Hannes Reinecke wrote:
On 9/24/21 10:28 AM, John Garry wrote:
Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.
However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap
depth.
As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue,
and
not per HW queue.
So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared
static
rqs.
Since there is now no valid HW queue index to be passed to the
blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).
Continue to use term "shared sbitmap" for now, as the meaning is known.
Signed-off-by: John Garry <john.garry@xxxxxxxxxx>
---
block/blk-mq-sched.c | 82 ++++++++++++++++++-------------------
block/blk-mq-tag.c | 61 ++++++++++------------------
block/blk-mq-tag.h | 6 +--
block/blk-mq.c | 91 +++++++++++++++++++++++-------------------
block/blk-mq.h | 5 ++-
include/linux/blk-mq.h | 15 ++++---
include/linux/blkdev.h | 3 +-
7 files changed, 125 insertions(+), 138 deletions(-)
The overall idea to keep the full request allocation per queue was to
ensure memory locality for the requests themselves.
When moving to a shared request structure we obviously loose that
feature.
But I'm not sure if that matters here; the performance impact might be
too small to be measurable, seeing that we'll be most likely bound by
hardware latencies anyway.
Nevertheless: have you tested for performance regressions with this
patchset?
I have tested relatively lower rates, like ~450K IOPS, without any
noticeable regression.
I'm especially thinking of Kashyaps high-IOPS megaraid setup; if there
is a performance impact that'll be likely scenario where we can
measure it.
I can test higher rates, like 2M IOPS, when I get access to the HW.
@Kashyap, Any chance you can help test performance here?
But even if there is a performance impact this patchset might be
worthwhile, seeing that it'll reduce the memory footprint massively.
Sure, I don't think that minor performance improvements can justify the
excessive memory.
JFYI, with 6x SAS SSDs on my arm64 board, I see:
Before (5.15-rc2 baseline):
none: 445K IOPs, mq-deadline: 418K IOPs (fio read)
After:
none: 442K IOPs, mq-deadline: 407K IOPs (fio read)
So only a marginal drop there for mq-deadline.
I'll try my 12x SAS SSD setup when I get a chance. Kashyap is kindly
also testing.
Thanks