Anthony Liguori wrote: > Gregory Haskins wrote: >> Anthony Liguori wrote: >> >>> I'm surprised so much effort is going into this, is there any >>> indication that this is even close to a bottleneck in any circumstance? >>> >> >> Yes. Each 1us of overhead is a 4% regression in something as trivial as >> a 25us UDP/ICMP rtt "ping".m > > It wasn't 1us, it was 350ns or something around there (i.e ~1%). I wasn't referring to "it". I chose my words carefully. Let me rephrase for your clarity: *each* 1us of overhead introduced into the signaling path is a ~4% latency regression for a round trip on a high speed network (note that this can also affect throughput at some level, too). I believe this point has been lost on you from the very beginning of the vbus discussions. I specifically generalized my statement above because #1 I assume everyone here is smart enough to convert that nice round unit into the relevant figure. And #2, there are multiple potential latency sources at play which we need to factor in when looking at the big picture. For instance, the difference between PF exit, and an IO exit (2.58us on x86, to be precise). Or whether you need to take a heavy-weight exit. Or a context switch to qemu, the the kernel, back to qemu, and back to the vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO. I know you wish that this whole discussion would just go away, but these little "300ns here, 1600ns there" really add up in aggregate despite your dismissive attitude towards them. And it doesn't take much to affect the results in a measurable way. As stated, each 1us costs ~4%. My motivation is to reduce as many of these sources as possible. So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% improvement. So what? Its still an improvement. If that improvement were for free, would you object? And we all know that this change isn't "free" because we have to change some code (+128/-0, to be exact). But what is it specifically you are objecting to in the first place? Adding hypercall support as an pv_ops primitive isn't exactly hard or complex, or even very much code. Besides, I've already clearly stated multiple times (including in this very thread) that I agree that I am not yet sure if the 350ns/1.4% improvement alone is enough to justify a change. So if you are somehow trying to make me feel silly by pointing out the "~1%" above, you are being ridiculous. Rather, I was simply answering your question as to whether these latency sources are a real issue. The answer is "yes" (assuming you care about latency) and I gave you a specific example and a method to quantify the impact. It is duly noted that you do not care about this type of performance, but you also need to realize that your "blessing" or acknowledgment/denial of the problem domain has _zero_ bearing on whether the domain exists, or if there are others out there that do care about it. Sorry. > >> for request-response, this is generally for *every* packet since you >> cannot exploit buffering/deferring. >> >> Can you back up your claim that PPC has no difference in performance >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" >> like instructions, but clearly there are ways to cause a trap, so >> presumably we can measure the difference between a PF exit and something >> more explicit). >> > > First, the PPC that KVM supports performs very poorly relatively > speaking because it receives no hardware assistance So wouldn't that be making the case that it could use as much help as possible? > this is not the right place to focus wrt optimizations. Odd choice of words. I am advocating the opposite (broad solution to many arches and many platforms (i.e. hypervisors)) and therefore I am not "focused" on it (or really any one arch) at all per se. I am _worried_ however, that we could be overlooking PPC (as an example) if we ignore the disparity between MMIO and HC since other higher performance options are not available like PIO. The goal on this particular thread is to come up with an IO interface that works reasonably well across as many hypervisors as possible. MMIO/PIO do not appear to fit that bill (at least not without tunneling them over HCs) If I am guilty of focusing anywhere too much it would be x86 since that is the only development platform I have readily available. > > > And because there's no hardware assistance, there simply isn't a > hypercall instruction. Are PFs the fastest type of exits? Probably > not but I honestly have no idea. I'm sure Hollis does though. > > Page faults are going to have tremendously different performance > characteristics on PPC too because it's a software managed TLB. > There's no page table lookup like there is on x86. The difference between MMIO and "HC", and whether it is cause for concern will continue to be pure speculation until we can find someone with a PPC box willing to run some numbers. I will point out that we both seem to theorize that PFs will yield lower output than alternatives, so it would seem you are actually making my point for me. > > As a more general observation, we need numbers to justify an > optimization, not to justify not including an optimization. > > In other words, the burden is on you to present a scenario where this > optimization would result in a measurable improvement in a real world > work load. I have already done this. You seem to have chosen to ignore my statements and results, but if you insist on rehashing: I started this project by analyzing system traces and finding some of the various bottlenecks in comparison to a native host. Throughput was already pretty decent, but latency was pretty bad (and recently got *really* bad, but I know you already have a handle on whats causing that). I digress...one of the conclusions of the research was that I wanted to focus on building an IO subsystem designed to minimize the quantity of exits, minimize the cost of each exit, and shorten the end-to-end signaling path to achieve optimal performance. I also wanted to build a system that was extensible enough to work with a variety of client types, on a variety of architectures, etc, so we would only need to solve these problems "once". The end result was vbus, and the first working example was venet. The measured performance data of this work was as follows: 802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back with Chelsio T3 10GE via crossover. Bare metal : tput = 9717Mb/s, round-trip = 30396pps (33us rtt) Virtio-net (PCI) : tput = 4578Mb/s, round-trip = 249pps (4016us rtt) Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt) For more details: http://lkml.org/lkml/2009/4/21/408 You can download this today and run it, review it, compare it. Whatever you want. As part of that work, I measured IO performance in KVM and found HCs to be the superior performer. You can find these results here: http://developer.novell.com/wiki/index.php/WhyHypercalls. Without having access to platforms other than x86, but with an understanding of computer architecture, I speculate that the difference should be even more profound everywhere else in lieu of a PIO primitive. And even on the platform which should yield the least benefit (x86), the gain (~1.4%) is not huge, but its not zero either. Therefore, my data and findings suggest that this is not a bad optimization to consider IMO. My final results above do not indicate to me that I was completely wrong in my analysis. Now I know you have been quick in the past to dismiss my efforts, and to claim you can get the same results without needing the various tricks and optimizations I uncovered. But quite frankly, until you post some patches for community review and comparison (as I have done), it's just meaningless talk. Perhaps you are truly unimpressed with my results and will continue to insist that my work including my final results are "virtually meaningless". Or perhaps you have an agenda. You can keep working against me and try to block anything I suggest by coming up with what appears to be any excuse you can find, making rude replies on email threads and snide comments on IRC, etc. It's simply not necessary. Alternatively, you can work _with_ me to help try to improve KVM and Linux (e.g. I still need someone to implement a virtio-net backend, and who knows it better than you). The choice is yours. But lets cut the BS because it's counter productive, and frankly, getting old. Regards, -Greg
Attachment:
signature.asc
Description: OpenPGP digital signature