Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

Max Gurtovoy <mgurtovoy@xxxxxxxxxx> · Wed, 29 Sep 2021 12:48:10 +0300

On 9/29/2021 9:50 AM, Leon Romanovsky wrote:
On Wed, Sep 29, 2021 at 02:28:08AM +0300, Max Gurtovoy wrote:
On 9/28/2021 7:27 PM, Leon Romanovsky wrote:
On Tue, Sep 28, 2021 at 06:59:15PM +0300, Max Gurtovoy wrote:
On 9/27/2021 9:23 PM, Leon Romanovsky wrote:
On Mon, Sep 27, 2021 at 08:25:09PM +0300, Max Gurtovoy wrote:
On 9/27/2021 2:34 PM, Leon Romanovsky wrote:
On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
To optimize performance, set the affinity of the block device tagset
according to the virtio device affinity.

Signed-off-by: Max Gurtovoy <mgurtovoy@xxxxxxxxxx>
---
     drivers/block/virtio_blk.c | 2 +-
     1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9b3bd083b411..1c68c3e0ebf9 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
     	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
     	vblk->tag_set.ops = &virtio_mq_ops;
     	vblk->tag_set.queue_depth = queue_depth;
-	vblk->tag_set.numa_node = NUMA_NO_NODE;
+	vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
I afraid that by doing it, you will increase chances to see OOM, because
in NUMA_NO_NODE, MM will try allocate memory in whole system, while in
the latter mode only on specific NUMA which can be depleted.
This is a common methodology we use in the block layer and in NVMe subsystem
and we don't afraid of the OOM issue you raised.
There are many reasons for that, but we are talking about virtio here
and not about NVMe.
Ok. what reasons ?
For example, NVMe are physical devices that rely on DMA operations,
PCI connectivity e.t.c to operate. Such systems indeed can benefit from
NUMA locality hints. At the end, these devices are physically connected
to that NUMA node.
FYI Virtio devices are also physical devices that have PCI interface and
rely on DMA operations.

from virtio spec: "Virtio devices use normal bus mechanisms of interrupts
and DMA which should be familiar
to any device driver author".
Yes, this is how bus in Linux is implemented, there is nothing new here.

So why you said that virtio is not a PCI device with DMA capabilities ?


Also we develop virtio HW at NVIDIA for blk and net devices with our SNAP
technology.

These devices are connected via PCI bus to the host.
How all these related to general virtio-blk implementation?

They use the same driver.

We develop HW virtio devices for bare metal cloud and also for 
virtualized cloud that use the SRIOV feature of the PF (real PF).


We also support SRIOV.

Same it true also for paravirt devices that are emulated by QEMU but still
the guest sees them as PCI devices.
Yes, the key word here - "emulated".

It doesn't matter. The guest kernel doesn't know if it's a paravirt 
device or real NVIDIA HW virtio SNAP device.

And FYI, a guest can also have 2 NUMA nodes and can benefit from this patch.


In our case, virtio-blk is a software interface that doesn't have all
these limitations. On the contrary, the virtio-blk can be created on one
CPU and moved later to be close to the QEMU which can run on another NUMA
node.
Not at all. virtio is HW interface.
Virtio are para-virtualized devices that are represented as HW interfaces
in the guest OS. They are not needed to be real devices in the hypervisor,
which is my (and probably most of the world) use case.

Again, the kernel doesn't care or know if its a paravirt device or not. 
And it shouldn't care.

This patch is for kernel driver and not QEMU.


My QEMU command line contains something like that: "-drive file=IMAGE.img,if=virtio"

This is one option.

For NVIDIA HW device, you pass a virtio device exactly how you pass a 
mlx5 device - using vfio + vfio_pci.



I don't understand what are you saying here ?

Also this patch increases chances to get OOM by factor of NUMA nodes.
This is common practice in Linux for storage drivers. Why does it bothers
you at all ?
Do I need a reason to ask for a clarification for publicly posted patch
in open mailing list?

I use virtio and care about it.

I meant, why don't you want to change the entire block layer and NVMe 
subsystem ?

Why only this patch bothers you ?


I already decreased the memory footprint for virtio blk devices.
As I wrote before, you decreased by several KB, but by this patch you
limited available memory in magnitudes.


Before your patch, the virtio_blk can allocate from X memory, after your
patch it will be X/NUMB_NUMA_NODES.
So go ahead and change all the block layer if it bothers you so much.

Also please change the NVMe subsystem when you do it.
I suggest less radical approach - don't take patches without proven
benefit.

We are in 2021, let's rely on NUMA node policy.

I'm trying to add NUMA policy here. Exactly.



And lets see what the community will say.
Stephen asked you for performance data too. I'm not alone here.


I said I'll have a V2.

I also would like to hear the opinion of the block maintainers like Jens 
and Christoph regarding numa affinity for block drivers.

In addition, it has all chances to even hurt performance.

So yes, post v2, but as Stefan and I asked, please provide supportive
performance results, because what was done for another subsystem doesn't
mean that it will be applicable here.
I will measure the perf but even if we wont see an improvement since it
might not be the bottleneck, this changes should be merged since this is the
way the block layer is optimized.
This is not acceptance criteria to merge patches.

This is a micro optimization that commonly used also in other subsystem. And
non of your above reasons (PCI, SW device, DMA) is true.
Every subsystem is different, in some it makes sense, in others it doesn't.

But you were wrong saying that virtio device is not PCI HW device that 
uses DMA.

Do you understand the solution now ?


We (RDMA) had very long discussion (together with perf data) and heavily tailored
test to measure influence of per-node allocations and guess what? We didn't see
any performance advantage.

https://lore.kernel.org/linux-rdma/c34a864803f9bbd33d3f856a6ba2dd595ab708a7.1620729033.git.leonro@xxxxxxxxxx/

So go ahead and change all the kernel or the block layer.

As you said, for RDMA subsystem it might not be a good idea. I don't 
want to discuss RDMA considerations in this thread.

Lets talk storage and virtio.


Virtio blk device is in 99% a PCI device (paravirt or real HW) exactly like
any other PCI device you are familiar with.

It's connected physically to some slot, it has a BAR, MMIO, configuration
space, etc..
In general case, it is far from being true.

it's exactly true.

But let give MST and Stephan to comment.


Thanks.

Thanks