On Mon, Jun 29, 2020 at 10:26:46AM +0100, Stefan Hajnoczi wrote: > On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote: > > > > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote: > > > These patches are not ready to be merged because I was unable to measure a > > > performance improvement. I'm publishing them so they are archived in case > > > someone picks up this work again in the future. > > > > > > The goal of these patches is to allocate virtqueues and driver state from the > > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA > > > topology and virtio devices spread across vNUMA nodes benefit from this. In > > > other cases the memory placement is fine and we don't need to take NUMA into > > > account inside the guest. > > > > > > These patches could be extended to virtio_net.ko and other devices in the > > > future. I only tested virtio_blk.ko. > > > > > > The benchmark configuration was designed to trigger worst-case NUMA placement: > > > * Physical NVMe storage controller on host NUMA node 0 It's possible that numa is not such a big deal for NVMe. And it's possible that bios misconfigures ACPI reporting NUMA placement incorrectly. I think that the best thing to try is to use a ramdisk on a specific numa node. > > > * IOThread pinned to host NUMA node 0 > > > * virtio-blk-pci device in vNUMA node 1 > > > * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0 > > > * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1 > > > > > > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA > > > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic= > > > e. > > > Applying these patches fixes memory placement so that virtqueues and driver > > > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located. > > > > > > The fio 4KB randread benchmark results do not show a significant improvement: > > > > > > Name IOPS Error > > > virtio-blk 42373.79 =C2=B1 0.54% > > > virtio-blk-numa 42517.07 =C2=B1 0.79% > > > > > > I remember I did something similar in vhost by using page_to_nid() for > > descriptor ring. And I get little improvement as shown here. > > > > Michael reminds that it was probably because all data were cached. So I > > doubt if the test lacks sufficient stress on the cache ... > > Yes, that sounds likely. If there's no real-world performance > improvement then I'm happy to leave these patches unmerged. > > Stefan Well that was for vhost though. This is virtio, which is different. Doesn't some benchmark put pressure on the CPU cache? I kind of feel there should be a difference, and the fact there isn't means there's some other bottleneck somewhere. Might be worth figuring out. -- MST