Re: [PATCH rfc] nvme: support io stats on the mpath device

Sagi Grimberg <sagi@xxxxxxxxxxx> · Mon, 3 Oct 2022 11:35:52 +0300

On 9/30/22 03:08, Jens Axboe wrote:
On 9/29/22 10:25 AM, Sagi Grimberg wrote:

3. Do you have some performance numbers (we're touching the fast path here) ?

This is pretty light-weight, accounting is per-cpu and only wrapped by
preemption disable. This is a very small price to pay for what we gain.

Is it? Enabling IO stats for normal devices has a very noticeable impact
on performance at the higher end of the scale.

Interesting, I didn't think this would be that noticeable. How much
would you quantify the impact in terms of %?

If we take it to the extreme - my usual peak benchmark, which is drive
limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I
lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not
as dramatic, but still quite noticeable. This is just using a single
thread on a single CPU core per drive, so not throwing tons of CPU at
it.

Now, I have no idea how well nvme multipath currently scales or works.

Should be pretty scalable and efficient. There is no bio cloning and
the only shared state is an srcu wrapping the submission path and path
lookup.

Would be interesting to test that separately. But if you were to double
(or more, I guess 3x if you're doing the exposed device and then adding
stats to at least two below?) the overhead, that'd certainly not be
free.

It is not 3x, in the patch nvme-multipath is accounting separately from
the bottom devices, so each request is accounted once for the bottom
device and once for the upper device.

But again, my working assumption is that IO stats must be exposed for
a nvme-multipath device (unless the user disabled them). So it is a
matter of weather we take a simple approach, where nvme-multipath does
"double" accounting or, we come up with a scheme that allows the driver
to collect stats on behalf of the block layer, and then add non-trivial
logic to combine stats like iops/bw/latency accurately from the bottom
devices.

My vote would be to go with the former.

I don't have any insight on this for blk-mq, probably because I've never
seen any user turn IO stats off (or at least don't remember).

Most people don't care, but some certainly do. As per the above, it's
noticeable enough that it makes a difference if you're chasing latencies
or peak performance.

My (very limited) testing did not show any noticeable differences for
nvme-loop. All I'm saying that we need to have IO stats for the mpath
device node. If there is a clever way to collect this from the hidden
devices just for nvme, great, but we need to expose these stats.

  From a previous message, sounds like that's just some qemu setup? Hard
to measure anything there with precision in my experience, and it's not
really peak performance territory either.

It's not qemu, it is null_blk exported over nvme-loop (nvmet loop
device). so it is faster, but definitely not something that can provide
insight in the realm of real HW.