On 9/28/2021 9:47 AM, Stefan Hajnoczi wrote:
On Mon, Sep 27, 2021 at 08:39:30PM +0300, Max Gurtovoy wrote:
On 9/27/2021 11:09 AM, Stefan Hajnoczi wrote:
On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
To optimize performance, set the affinity of the block device tagset
according to the virtio device affinity.
Signed-off-by: Max Gurtovoy <mgurtovoy@xxxxxxxxxx>
---
drivers/block/virtio_blk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9b3bd083b411..1c68c3e0ebf9 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
vblk->tag_set.ops = &virtio_mq_ops;
vblk->tag_set.queue_depth = queue_depth;
- vblk->tag_set.numa_node = NUMA_NO_NODE;
+ vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
vblk->tag_set.cmd_size =
sizeof(struct virtblk_req) +
I implemented NUMA affinity in the past and could not demonstrate a
performance improvement:
https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html
The pathological case is when a guest with vNUMA has the virtio-blk-pci
device on the "wrong" host NUMA node. Then memory accesses should cross
NUMA nodes. Still, it didn't seem to matter.
I think the reason you didn't see any improvement is since you didn't use
the right device for the node query. See my patch 1/2.
That doesn't seem to be the case. Please see
drivers/base/core.c:device_add():
/* use parent numa_node */
if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
set_dev_node(dev, dev_to_node(parent));
IMO it's cleaner to use dev_to_node(&vdev->dev) than to directly access
the parent.
Have I missed something?
but dev_to_node(dev) is 0 IMO.
who set it to NUMA_NO_NODE ?
I can try integrating these patches in my series and fix it.
BTW, we might not see a big improvement because of other bottlenecks but
this is known perf optimization we use often in block storage drivers.
Let's see benchmark results. Otherwise this is just dead code that adds
complexity.
Stefan