Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

Avi Kivity <avi@xxxxxxxxxx> · Wed, 08 Sep 2010 12:28:21 +0300

 On 09/08/2010 12:22 PM, Krishna Kumar2 wrote:
Avi Kivity<avi@xxxxxxxxxx>  wrote on 09/08/2010 01:17:34 PM:

   On 09/08/2010 10:28 AM, Krishna Kumar wrote:
Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
     Testing showed 3-8% performance gain for upto 8 netperf
     sessions (and sometimes 16), but BW dropped with more
     sessions.  However, implementing per-txq vhost improved
     BW significantly all the way to 128 sessions.
Why were vhost kernel changes required?  Can't you just instantiate more
vhost queues?
I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions.

Oh - so the interface has not changed (which can be seen from the 
patch).  That was my concern, I remembered that we planned for vhost-net 
to be multiqueue-ready.

The new guest and qemu code work with old vhost-net, just with reduced 
performance, yes?

I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.

No need.

Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts
        CPU0     CPU1     CPU2    CPU3
40:   0        0        0       0        PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
How are vhost threads and host interrupts distributed?  We need to move
vhost queue threads to be colocated with the related vcpu threads (if no
extra cores are available) or on the same socket (if extra cores are
available).  Similarly, move device interrupts to the same core as the
vhost thread.
All my testing was without any tuning, including binding netperf&
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results?

I hope so!

Are you suggesting this
combination:
	IRQ on guest:
		40: CPU0
		41: CPU1
		42: CPU2
		43: CPU3 (all CPUs are on socket #0)
	vhost:
		thread #0:  CPU0
		thread #1:  CPU1
		thread #2:  CPU2
		thread #3:  CPU3
	qemu:
		thread #0:  CPU4
		thread #1:  CPU5
		thread #2:  CPU6
		thread #3:  CPU7 (all CPUs are on socket#1)

May be better to put vcpu threads and vhost threads on the same socket.

Also need to affine host interrupts.

	netperf/netserver:
		Run on CPUs 0-4 on both sides

The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.

Definitely.  Heavy tuning is not a useful path for general end users.  
We need to make sure the the scheduler is able to arrive at the optimal 
layout without pinning (but perhaps with hints).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html