Re: VirtIO vs Emulation Netperf benchmark results

Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 11 Oct 2012 10:44:20 -0400

On Thu, Oct 11, 2012 at 6:12 AM, Antonios Motakis
<a.motakis@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> Sorry for a repost, pressed reply instead of reply to all.
>
> On Thu, Oct 11, 2012 at 11:55 AM, Alexander Graf <agraf@xxxxxxx> wrote:
>>
>>
>>
>> On 11.10.2012, at 11:46, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>
>> > On 10/10/12 19:58, Alexander Graf wrote:
>> >>
>> >>
>> >> On 10.10.2012, at 20:52, Christoffer Dall
>> >> <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>> >>
>> >>> On Wed, Oct 10, 2012 at 2:50 PM, Alexander Graf <agraf@xxxxxxx> wrote:
>> >>>>
>> >>>>
>> >>>> On 10.10.2012, at 20:39, Alexander Spyridakis
>> >>>> <a.spyridakis@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>> >>>>
>> >>>> For your information, with the latest developments related to VirtIO,
>> >>>> I run
>> >>>> netperf a couple of times to see the exact standing of network
>> >>>> performance
>> >>>> on the guests.
>> >>>>
>> >>>> The test was to run netperf -H "ip of LAN node", which tests TCP
>> >>>> traffic for
>> >>>> 10 seconds.
>> >>>>
>> >>>> x86 - x86:  ~96 Mbps - reference between two different computers
>> >>>> ARM Host  - x86:  ~80 Mbps
>> >>>> ARM Guest - x86:  ~ 2 Mbps - emulation
>> >>>> ARM Guest - x86:  ~74 Mbps - VirtIO
>> >>>>
>> >>>> From these we conclude that:
>> >>>>
>> >>>> As expected x86 to x86 communication can reach the limit of the 100
>> >>>> Mbps
>> >>>> LAN.
>> >>>> The ARM board seems to not be capable of the LAN.
>> >>>> Network emulation in QEMU is more than just slow (expected).
>> >>>>
>> >>>>
>> >>>> Why is this expected? This performance drop is quite terrifying.
>> >>>>
>> >>>
>> >>> I think he means expected as in, we already know we have this
>> >>> terrifying problem. I'm looking into this right now, and I believe
>> >>> Marc is also on this.
>> >>
>> >> Ah, good :). Since you are on a dual-core machine with lots of traffic,
>> >> you should get almost no vmexits for virtio queue processing.
>> >>
>> >> Since we know that this is a fast case, the big difference to emulated
>> >> devices are the exits. So I'd search there :).
>> >
>> > There's a number of things we're aware of:
>> >
>> > - The emulated device is pure PIO. Using this kind of device is always
>> > going to suck, and even more on KVM. We could use a "less braindead"
>> > model (some DMA capable device), but as we depart from the real VE
>> > board, I'd rather go virtio all the way.
>>
>> Well, you should try to get comparable performance numbers. If that means
>> exposing that braindead device on an x86 vm and turning off coalesced mmio,
>> so be it.
>>
>> The alternative is to expose PCI into the guest, even when it's only
>> half-working. It's not meant for production, but to get performance
>> comparison data that you can sanity check against x86 to see if (and what)
>> you're doing wrong.
>>
>> >
>> > - Our exit path is painfully long. We could maybe make it more efficient
>> > by being more lazy, and delay the switch of some of the state until we
>> > get preempted (VFP state, for example). Not sure how much of an
>> > improvement this would make, though.
>>
>> Lazy FP switch bought me quize a significant speedup on ppc. It won't help
>> you here though. User space exits need to restore that state regardless.
>> Unless the guest hasn't used FP. Then you can save yourself both ways of FP
>> state switches.
>>
>
> VFP switches are already being done lazily however, only when the guest
> actually uses some FP or Advanced SIMD instructions, and not on entry. In
> fact, when we lazy switch the VFP registers, we return directly from
> Hypermode interrupt context to the guest, without really giving the chance
> to the host to do much. We do not go all the way back to the ioctl loop.
>
> However, on the next vm exit we will switch back to the host state
> regardless of whether the host is going to use VFP or not, but I don't think
> optimizing that would offer any big benefits, especially for I/O.
>
> Of course things could always be improved, for example we could try handing
> the VFP/NEON control registers separately and emulate them, instead of doing
> a complete switch every time the guest does something simple, e.g. only to
> check whether VFP is enabled. But we would need some numbers to know whether
> this will do things better or worse, since this implies another exit.
>
> Best regards,
> Antonios
>
>>
>> >
>> > - Our memory bandwidth is reduced by the number of TLB entries we waste
>> > by not using section mappings instead of 4kB pages. Running hackbench on
>> > a guest shows quite a slowdown that should mostly go away if/when
>> > userspace switches to huge pages as backing store. I expect virtio to
>> > suffer from the same problem.
>>
>> That one should be in a completely different ballpark. I'd be very
>> surprised if you get more than 10% slowdowns in TLB miss intensive
>> workloads. Definitely not as low of a hanging fruit as we see here.
>>
>> Alex
>>
>> >
>> > Once we've addressed these points, I expect the IO performance to become
>> > better. At least by some margin.
>> >

I ran perf a bit yesterday and it seems we spend approx. 5% of the
vcpu thread's time on vgic save/restore.  I don't know if this can be
optimized at all though.

-Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm