On 09/08/2010 12:22 PM, Krishna Kumar2 wrote:
Avi Kivity<avi@xxxxxxxxxx> wrote on 09/08/2010 01:17:34 PM:
On 09/08/2010 10:28 AM, Krishna Kumar wrote:
Following patches implement Transmit mq in virtio-net. Also
included is the user qemu changes.
1. This feature was first implemented with a single vhost.
Testing showed 3-8% performance gain for upto 8 netperf
sessions (and sometimes 16), but BW dropped with more
sessions. However, implementing per-txq vhost improved
BW significantly all the way to 128 sessions.
Why were vhost kernel changes required? Can't you just instantiate more
vhost queues?
I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions.
Oh - so the interface has not changed (which can be seen from the
patch). That was my concern, I remembered that we planned for vhost-net
to be multiqueue-ready.
The new guest and qemu code work with old vhost-net, just with reduced
performance, yes?
I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.
No need.
Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts
CPU0 CPU1 CPU2 CPU3
40: 0 0 0 0 PCI-MSI-edge virtio0-config
41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
How are vhost threads and host interrupts distributed? We need to move
vhost queue threads to be colocated with the related vcpu threads (if no
extra cores are available) or on the same socket (if extra cores are
available). Similarly, move device interrupts to the same core as the
vhost thread.
All my testing was without any tuning, including binding netperf&
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results?
I hope so!
Are you suggesting this
combination:
IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
vhost:
thread #0: CPU0
thread #1: CPU1
thread #2: CPU2
thread #3: CPU3
qemu:
thread #0: CPU4
thread #1: CPU5
thread #2: CPU6
thread #3: CPU7 (all CPUs are on socket#1)
May be better to put vcpu threads and vhost threads on the same socket.
Also need to affine host interrupts.
netperf/netserver:
Run on CPUs 0-4 on both sides
The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.
Definitely. Heavy tuning is not a useful path for general end users.
We need to make sure the the scheduler is able to arrive at the optimal
layout without pinning (but perhaps with hints).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html