On Thu, Oct 11, 2012 at 6:12 AM, Antonios Motakis <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> wrote: > Sorry for a repost, pressed reply instead of reply to all. > > On Thu, Oct 11, 2012 at 11:55 AM, Alexander Graf <agraf@xxxxxxx> wrote: >> >> >> >> On 11.10.2012, at 11:46, Marc Zyngier <marc.zyngier@xxxxxxx> wrote: >> >> > On 10/10/12 19:58, Alexander Graf wrote: >> >> >> >> >> >> On 10.10.2012, at 20:52, Christoffer Dall >> >> <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote: >> >> >> >>> On Wed, Oct 10, 2012 at 2:50 PM, Alexander Graf <agraf@xxxxxxx> wrote: >> >>>> >> >>>> >> >>>> On 10.10.2012, at 20:39, Alexander Spyridakis >> >>>> <a.spyridakis@xxxxxxxxxxxxxxxxxxxxxx> wrote: >> >>>> >> >>>> For your information, with the latest developments related to VirtIO, >> >>>> I run >> >>>> netperf a couple of times to see the exact standing of network >> >>>> performance >> >>>> on the guests. >> >>>> >> >>>> The test was to run netperf -H "ip of LAN node", which tests TCP >> >>>> traffic for >> >>>> 10 seconds. >> >>>> >> >>>> x86 - x86: ~96 Mbps - reference between two different computers >> >>>> ARM Host - x86: ~80 Mbps >> >>>> ARM Guest - x86: ~ 2 Mbps - emulation >> >>>> ARM Guest - x86: ~74 Mbps - VirtIO >> >>>> >> >>>> From these we conclude that: >> >>>> >> >>>> As expected x86 to x86 communication can reach the limit of the 100 >> >>>> Mbps >> >>>> LAN. >> >>>> The ARM board seems to not be capable of the LAN. >> >>>> Network emulation in QEMU is more than just slow (expected). >> >>>> >> >>>> >> >>>> Why is this expected? This performance drop is quite terrifying. >> >>>> >> >>> >> >>> I think he means expected as in, we already know we have this >> >>> terrifying problem. I'm looking into this right now, and I believe >> >>> Marc is also on this. >> >> >> >> Ah, good :). Since you are on a dual-core machine with lots of traffic, >> >> you should get almost no vmexits for virtio queue processing. >> >> >> >> Since we know that this is a fast case, the big difference to emulated >> >> devices are the exits. So I'd search there :). >> > >> > There's a number of things we're aware of: >> > >> > - The emulated device is pure PIO. Using this kind of device is always >> > going to suck, and even more on KVM. We could use a "less braindead" >> > model (some DMA capable device), but as we depart from the real VE >> > board, I'd rather go virtio all the way. >> >> Well, you should try to get comparable performance numbers. If that means >> exposing that braindead device on an x86 vm and turning off coalesced mmio, >> so be it. >> >> The alternative is to expose PCI into the guest, even when it's only >> half-working. It's not meant for production, but to get performance >> comparison data that you can sanity check against x86 to see if (and what) >> you're doing wrong. >> >> > >> > - Our exit path is painfully long. We could maybe make it more efficient >> > by being more lazy, and delay the switch of some of the state until we >> > get preempted (VFP state, for example). Not sure how much of an >> > improvement this would make, though. >> >> Lazy FP switch bought me quize a significant speedup on ppc. It won't help >> you here though. User space exits need to restore that state regardless. >> Unless the guest hasn't used FP. Then you can save yourself both ways of FP >> state switches. >> > > VFP switches are already being done lazily however, only when the guest > actually uses some FP or Advanced SIMD instructions, and not on entry. In > fact, when we lazy switch the VFP registers, we return directly from > Hypermode interrupt context to the guest, without really giving the chance > to the host to do much. We do not go all the way back to the ioctl loop. > > However, on the next vm exit we will switch back to the host state > regardless of whether the host is going to use VFP or not, but I don't think > optimizing that would offer any big benefits, especially for I/O. > > Of course things could always be improved, for example we could try handing > the VFP/NEON control registers separately and emulate them, instead of doing > a complete switch every time the guest does something simple, e.g. only to > check whether VFP is enabled. But we would need some numbers to know whether > this will do things better or worse, since this implies another exit. > > Best regards, > Antonios > >> >> > >> > - Our memory bandwidth is reduced by the number of TLB entries we waste >> > by not using section mappings instead of 4kB pages. Running hackbench on >> > a guest shows quite a slowdown that should mostly go away if/when >> > userspace switches to huge pages as backing store. I expect virtio to >> > suffer from the same problem. >> >> That one should be in a completely different ballpark. I'd be very >> surprised if you get more than 10% slowdowns in TLB miss intensive >> workloads. Definitely not as low of a hanging fruit as we see here. >> >> Alex >> >> > >> > Once we've addressed these points, I expect the IO performance to become >> > better. At least by some margin. >> > I ran perf a bit yesterday and it seems we spend approx. 5% of the vcpu thread's time on vgic save/restore. I don't know if this can be optimized at all though. -Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm