Re: [RFC PATCH 0/3] generic hypercall support

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Mon, 11 May 2009 12:31:03 -0500

Gregory Haskins wrote:
I specifically generalized my statement above because #1 I assume
everyone here is smart enough to convert that nice round unit into the
relevant figure.  And #2, there are multiple potential latency sources
at play which we need to factor in when looking at the big picture.  For
instance, the difference between PF exit, and an IO exit (2.58us on x86,
to be precise).  Or whether you need to take a heavy-weight exit.  Or a
context switch to qemu, the the kernel, back to qemu, and back to the
vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
I know you wish that this whole discussion would just go away, but these
little "300ns here, 1600ns there" really add up in aggregate despite
your dismissive attitude towards them.  And it doesn't take much to
affect the results in a measurable way.  As stated, each 1us costs ~4%. 
My motivation is to reduce as many of these sources as possible.

So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
improvement.  So what?  Its still an improvement.  If that improvement
were for free, would you object?  And we all know that this change isn't
"free" because we have to change some code (+128/-0, to be exact).  But
what is it specifically you are objecting to in the first place?  Adding
hypercall support as an pv_ops primitive isn't exactly hard or complex,
or even very much code.

Where does 25us come from?  The number you post below are 33us and 
66us.  This is part of what's frustrating me in this thread.  Things are 
way too theoretical.  Saying that "if packet latency was 25us, then it 
would be a 1.4% improvement" is close to misleading.  The numbers you've 
posted are also measuring on-box speeds.  What really matters are 
off-box latencies and that's just going to exaggerate.

IIUC, if you switched vbus to using PIO today, you would go from 66us to 
to 65.65, which you'd round to 66us for on-box latencies.  Even if you 
didn't round, it's a 0.5% improvement in latency.

Adding hypercall support as a pv_ops primitive is adding a fair bit of 
complexity.  You need a hypercall fd mechanism to plumb this down to 
userspace otherwise, you can't support migration from in-kernel backend 
to non in-kernel backend.  You need some way to allocate hypercalls to 
particular devices which so far, has been completely ignored.  I've 
already mentioned why hypercalls are also unfortunate from a guest 
perspective.  They require kernel patching and this is almost certainly 
going to break at least Vista as a guest.  Certainly Windows 7.

So it's not at all fair to trivialize the complexity introduce here.  
I'm simply asking for justification to introduce this complexity.  I 
don't see why this is unfair for me to ask.

As a more general observation, we need numbers to justify an
optimization, not to justify not including an optimization.

In other words, the burden is on you to present a scenario where this
optimization would result in a measurable improvement in a real world
work load.

I have already done this.  You seem to have chosen to ignore my
statements and results, but if you insist on rehashing:

I started this project by analyzing system traces and finding some of
the various bottlenecks in comparison to a native host.  Throughput was
already pretty decent, but latency was pretty bad (and recently got
*really* bad, but I know you already have a handle on whats causing
that).  I digress...one of the conclusions of the research was that  I
wanted to focus on building an IO subsystem designed to minimize the
quantity of exits, minimize the cost of each exit, and shorten the
end-to-end signaling path to achieve optimal performance.  I also wanted
to build a system that was extensible enough to work with a variety of
client types, on a variety of architectures, etc, so we would only need
to solve these problems "once".  The end result was vbus, and the first
working example was venet.  The measured performance data of this work
was as follows:

802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
with Chelsio T3 10GE via crossover.

Bare metal            : tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net (PCI)    : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet      (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)

For more details:  http://lkml.org/lkml/2009/4/21/408

Sending out a massive infrastructure change that does things wildly 
differently from how they're done today without any indication of why 
those changes were necessary is disruptive.

If you could characterize all of the changes that vbus makes that are 
different from virtio, demonstrating at each stage why the change 
mattered and what benefit it brought, then we'd be having a completely 
different discussion.  I have no problem throwing away virtio today if 
there's something else better.

That's not what you've done though.  You wrote a bunch of code without 
understanding why virtio does things the way it does and then dropped it 
all on the list.  This isn't necessarily a bad exercise, but there's a 
ton of work necessary to determine which things vbus does differently 
actually matter.  I'm not saying that you shouldn't have done vbus, but 
I'm saying there's a bunch of analysis work that you haven't done that 
needs to be done before we start making any changes in upstream code.

I've been trying to argue why I don't think hypercalls are an important 
part of vbus from a performance perspective.   I've tried to demonstrate 
why I don't think this is an important part of vbus.  The frustration I 
have with this series is that you seem unwilling to compromise any 
aspect of vbus design.  I understand you've made your decisions  in vbus 
for some reasons and you think the way you've done things is better, but 
that's not enough.  We have virtio today, it provides greater 
functionality than vbus does, it supports multiple guest types, and it's 
gotten quite a lot of testing.  It has its warts, but most things that 
have been around for some time do.

Now I know you have been quick in the past to dismiss my efforts, and to
claim you can get the same results without needing the various tricks
and optimizations I uncovered.  But quite frankly, until you post some
patches for community review and comparison (as I have done), it's just
meaningless talk.

I can just as easily say that until you post a full series that covers 
all of the functionality that virtio has today, vbus is just meaningless 
talk.  But I'm trying not to be dismissive in all of this because I do 
want to see you contribute to the KVM paravirtual IO infrastructure.  
Clearly, you have useful ideas.

We can't just go rewriting things without a clear understanding of why 
something's better.  What's missing is a detailed analysis of what 
virtio-net does today and what vbus does so that it's possible to draw 
some conclusions.

For instance, this could look like:

For a single packet delivery:

150ns are spent from PIO operation
320ns are spent in heavy-weight exit handler
150ns are spent transitioning to userspace
5us are spent contending on qemu_mutex
30us are spent copying data in tun/tap driver
40us are spent waiting for RX
...

For vbus, it would look like:

130ns are spent from HC instruction
100ns are spent signaling TX thread
...

But single packet delivery is just one part of the puzzle.  Bulk 
transfers are also important.  CPU consumption is important.  How we 
address things like live migration, non-privileged user initialization, 
and userspace plumbing are all also important.

Right now, the whole discussion around this series is wildly speculative 
and quite frankly, counter productive.  A few RTT benchmarks are not 
sufficient to make any kind of forward progress here.  I certainly like 
rewriting things as much as anyone else, but you need a substantial 
amount of justification for it that so far hasn't been presented.

Do you understand what my concerns are and why I don't want to just 
switch to a new large infrastructure?

Do you feel like you understand what sort of data I'm looking for to 
justify the changes vbus is proposing to make?  Is this something your 
willing to do because IMHO this is a prerequisite for any sort of merge 
consideration.  The analysis of the virtio-net side of things is just as 
important as the vbus side of things.

I've tried to explain this to you a number of times now and so far it 
doesn't seem like I've been successful.  If it isn't clear, please let 
me know.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html