On 3/28/24 11:38, Sagi Grimberg wrote:
On 26/03/2024 17:35, Hannes Reinecke wrote:
Hi all,
there had been several attempts to implement a latency-based I/O
scheduler for native nvme multipath, all of which had its issues.
So time to start afresh, this time using the QoS framework
already present in the block layer.
It consists of two parts:
- a new 'blk-nodelat' QoS module, which is just a simple per-node
latency tracker
- a 'latency' nvme I/O policy
Using the 'tiobench' fio script I'm getting:
WRITE: bw=531MiB/s (556MB/s), 33.2MiB/s-52.4MiB/s
(34.8MB/s-54.9MB/s), io=4096MiB (4295MB), run=4888-7718msec
WRITE: bw=539MiB/s (566MB/s), 33.7MiB/s-50.9MiB/s
(35.3MB/s-53.3MB/s), io=4096MiB (4295MB), run=5033-7594msec
READ: bw=898MiB/s (942MB/s), 56.1MiB/s-75.4MiB/s
(58.9MB/s-79.0MB/s), io=4096MiB (4295MB), run=3397-4560msec
READ: bw=1023MiB/s (1072MB/s), 63.9MiB/s-75.1MiB/s
(67.0MB/s-78.8MB/s), io=4096MiB (4295MB), run=3408-4005msec
for 'round-robin' and
WRITE: bw=574MiB/s (601MB/s), 35.8MiB/s-45.5MiB/s
(37.6MB/s-47.7MB/s), io=4096MiB (4295MB), run=5629-7142msec
WRITE: bw=639MiB/s (670MB/s), 39.9MiB/s-47.5MiB/s
(41.9MB/s-49.8MB/s), io=4096MiB (4295MB), run=5388-6408msec
READ: bw=1024MiB/s (1074MB/s), 64.0MiB/s-73.7MiB/s
(67.1MB/s-77.2MB/s), io=4096MiB (4295MB), run=3475-4000msec
READ: bw=1013MiB/s (1063MB/s), 63.3MiB/s-72.6MiB/s
(66.4MB/s-76.2MB/s), io=4096MiB (4295MB), run=3524-4042msec
for 'latency' with 'decay' set to 10.
That's on a 32G FC testbed running against a brd target,
fio running with 16 thread.
Can you quantify the improvement? Also, the name latency suggest
that latency should be improved no?
'latency' refers to 'latency-based' I/O scheduler, ie it selects
the path with the least latency. It does not necessarily _improve_
the latency. Eg for truly symmetric fabrics it doesn't.
It _does_ improve matters when running on asymmetric fabrics
(eg on a two socket system with two PCI HBAs, each connected to one
socket, or like the example above with one path via 'loop', and the
other via 'tcp' and address '127.0.0.1').
And, of course, if you have congested fabrics, where it should be
able to direct I/O to the least congested path.
But I'll see to extract the latency numbers, too.
What I really wanted to show is that we _can_ track latency without
harming performance.
Cheers,
Hannes