Re: VirtIO vs Emulation Netperf benchmark results

Antonios Motakis <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 11 Oct 2012 12:12:42 +0200

Sorry for a repost, pressed reply instead of reply to all.

On Thu, Oct 11, 2012 at 11:55 AM, Alexander Graf <agraf@xxxxxxx> wrote:

On 11.10.2012, at 11:46, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:

> On 10/10/12 19:58, Alexander Graf wrote:

>>

>>

>> On 10.10.2012, at 20:52, Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>>

>>> On Wed, Oct 10, 2012 at 2:50 PM, Alexander Graf <agraf@xxxxxxx> wrote:

>>>>

>>>>

>>>> On 10.10.2012, at 20:39, Alexander Spyridakis

>>>> <a.spyridakis@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>>>>

>>>> For your information, with the latest developments related to VirtIO, I run

>>>> netperf a couple of times to see the exact standing of network performance

>>>> on the guests.

>>>>

>>>> The test was to run netperf -H "ip of LAN node", which tests TCP traffic for

>>>> 10 seconds.

>>>>

>>>> x86 - x86:  ~96 Mbps - reference between two different computers

>>>> ARM Host  - x86:  ~80 Mbps

>>>> ARM Guest - x86:  ~ 2 Mbps - emulation

>>>> ARM Guest - x86:  ~74 Mbps - VirtIO

>>>>

>>>> From these we conclude that:

>>>>

>>>> As expected x86 to x86 communication can reach the limit of the 100 Mbps

>>>> LAN.

>>>> The ARM board seems to not be capable of the LAN.

>>>> Network emulation in QEMU is more than just slow (expected).

>>>>

>>>>

>>>> Why is this expected? This performance drop is quite terrifying.

>>>>

>>>

>>> I think he means expected as in, we already know we have this

>>> terrifying problem. I'm looking into this right now, and I believe

>>> Marc is also on this.

>>

>> Ah, good :). Since you are on a dual-core machine with lots of traffic, you should get almost no vmexits for virtio queue processing.

>>

>> Since we know that this is a fast case, the big difference to emulated devices are the exits. So I'd search there :).

>

> There's a number of things we're aware of:

>

> - The emulated device is pure PIO. Using this kind of device is always

> going to suck, and even more on KVM. We could use a "less braindead"

> model (some DMA capable device), but as we depart from the real VE

> board, I'd rather go virtio all the way.

Well, you should try to get comparable performance numbers. If that means exposing that braindead device on an x86 vm and turning off coalesced mmio, so be it.

The alternative is to expose PCI into the guest, even when it's only half-working. It's not meant for production, but to get performance comparison data that you can sanity check against x86 to see if (and what) you're doing wrong.

>

> - Our exit path is painfully long. We could maybe make it more efficient

> by being more lazy, and delay the switch of some of the state until we

> get preempted (VFP state, for example). Not sure how much of an

> improvement this would make, though.

Lazy FP switch bought me quize a significant speedup on ppc. It won't help you here though. User space exits need to restore that state regardless. Unless the guest hasn't used FP. Then you can save yourself both ways of FP state switches.

VFP switches are already being done lazily however, only when the guest 
actually uses some FP or Advanced SIMD instructions, and not on entry. 
In fact, when we lazy switch the VFP registers, we return directly from 
Hypermode interrupt context to the guest, without really giving the 
chance to the host to do much. We do not go all the way back to the 
ioctl loop.

However, on the next vm exit we will switch back to the host state 
regardless of whether the host is going to use VFP or not, but I don't 
think optimizing that would offer any big benefits, especially for I/O.

Of course things could always be improved, for example we could try 
handing the VFP/NEON control registers separately and emulate them, 
instead of doing a complete switch every time the guest does something 
simple, e.g. only to check whether VFP is enabled. But we would need 
some numbers to know whether this will do things better or worse, since 
this implies another exit.

Best regards,
Antonios

>

> - Our memory bandwidth is reduced by the number of TLB entries we waste

> by not using section mappings instead of 4kB pages. Running hackbench on

> a guest shows quite a slowdown that should mostly go away if/when

> userspace switches to huge pages as backing store. I expect virtio to

> suffer from the same problem.

That one should be in a completely different ballpark. I'd be very surprised if you get more than 10% slowdowns in TLB miss intensive workloads. Definitely not as low of a hanging fruit as we see here.

Alex

>

> Once we've addressed these points, I expect the IO performance to become

> better. At least by some margin.

>

>    M.

> --

> Jazz is not dead. It just smells funny...

>

_______________________________________________

kvmarm mailing list

kvmarm@xxxxxxxxxxxxxxxxxxxxx

https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm