On Thu, Dec 19, 2013 at 3:48 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > On Thu, Dec 19, 2013 at 02:56:10PM +0200, Abel Gordon wrote: >> On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >> > On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote: >> >> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >> >> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote: >> >> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >> >> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote: >> >> >> >> Hi, >> >> >> >> >> >> >> >> Thank you all for your comments. >> >> >> >> I'm sorry for taking this long to reply, I was away on vacation.. >> >> >> >> >> >> >> >> It was a good, long discussion, many issues were raised, which we'd like >> >> >> >> to address with the following proposed roadmap for Elvis patches. >> >> >> >> In general, we believe it would be best to start with patches that are >> >> >> >> as simple as possible, providing the basic Elvis functionality, >> >> >> >> and attend to the more complicated issues in subsequent patches. >> >> >> >> >> >> >> >> Here's the road map for Elvis patches: >> >> >> > >> >> >> > Thanks for the follow up. Some suggestions below. >> >> >> > Please note they suggestions below merely represent >> >> >> > thoughts on merging upstream. >> >> >> > If as the first step you are content with keeping this >> >> >> > work as out of tree patches, in order to have >> >> >> > the freedom to experiment with interfaces and >> >> >> > performance, please feel free to ignore them. >> >> >> > >> >> >> >> 1. Shared vhost thread for multiple devices. >> >> >> >> >> >> >> >> The way to go here, we believe, is to start with a patch having a shared >> >> >> >> vhost thread for multiple devices of the SAME vm. >> >> >> >> The next step/patch may be handling vms belonging to the same cgroup. >> >> >> >> >> >> >> >> Finally, we need to extend the functionality so that the shared vhost >> >> >> >> thread >> >> >> >> serves multiple vms (not necessarily belonging to the same cgroup). >> >> >> >> >> >> >> >> There was a lot of discussion about the way to address the enforcement >> >> >> >> of cgroup policies, and we will consider the various solutions with a >> >> >> >> future >> >> >> >> patch. >> >> >> > >> >> >> > With respect to the upstream kernel, >> >> >> > I'm not sure a bunch of changes just for the sake of guests with >> >> >> > multiple virtual NIC cards makes sense. >> >> >> > And I wonder how this step, in isolation, will affect e.g. >> >> >> > multiqueue workloads. >> >> >> > But I guess if the numbers are convincing, this can be mergeable. >> >> >> >> >> >> Even if you have a single multiqueue device this change allows >> >> >> to create one vhost thread for all the queues, one vhost thread per >> >> >> queue or any other combination. I guess that depending on the workload >> >> >> and depending on the system utilization (free cycles/cores, density) >> >> >> you would prefer >> >> >> to use one or more vhost threads. >> >> > >> >> > That is already controllable from the guest though, which likely has a better >> >> > idea about the workload. >> >> >> >> but the guest has no idea about what's going on in the host system >> >> (e.g. other VMs I/O, cpu utilization of the host cores...) >> > >> > But again, you want to do things per VM now so you will have no idea >> > about other VMs, right? Host cpu utilization could be a useful input >> >> Razya shared a roadmap. The first step was to support sharing a thread >> for a single VM but the goal is to later on extend the mechanism to >> support multiple VMs and cgroups > > Yes, I got that. What I'm not sure of is whether this is just a > development roadmap, or do you expect to be able to merge things > upstream in this order as well. > If the later, all I'm saying is that I think you are doing this in the wrong > order: we'll likely have to merge first 4 then 3 then possibly 1+2 together - > but maybe 1+2 will have to wait until cgroups are sorted out. > That's just a hunch of course until you actually try to do it. > If the former, most of my comments don't really apply. > >> > for some heuristics, I agree, but nothing prevents us from sending >> > this info to guest agent and controlling multiqueue based on that >> > (kind of like balloon). >> >> IMHO, we should never share host internal information (e.g. resource >> utilization) wit the guest. That's supposed to be confidential >> information :) >> The balloon is a bit different... kvm asks the guest OS to give (if >> possible) some pages but kvm never sends to the balloon information >> about the memory utilization of the host. If the guest wishes to send >> information about it's own memory consumption (like it does for MOM), >> that's OK. >> >> So, the guest can share information with the host but the host should >> be the one to make the decisions. KVM should never share host >> information with the guest. > > It's also easy to just tell guest agent to turn multiqueue on/off if you have > a mind to. > > >> > >> >> > >> >> >> > >> >> >> >> >> >> >> >> 2. Creation of vhost threads >> >> >> >> >> >> >> >> We suggested two ways of controlling the creation and removal of vhost >> >> >> >> threads: >> >> >> >> - statically determining the maximum number of virtio devices per worker >> >> >> >> via a kernel module parameter >> >> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads >> >> >> >> >> >> >> >> It seems that it would be simplest to take the static approach as >> >> >> >> a first stage. At a second stage (next patch), we'll advance to >> >> >> >> dynamically >> >> >> >> changing the number of vhost threads, using the static module parameter >> >> >> >> only as a default value. >> >> >> > >> >> >> > I'm not sure how independent this is from 1. >> >> >> > With respect to the upstream kernel, >> >> >> > Introducing interfaces (which we'll have to maintain >> >> >> > forever) just for the sake of guests with >> >> >> > multiple virtual NIC cards does not look like a good tradeoff. >> >> >> > >> >> >> > So I'm unlikely to merge this upstream without making it useful cross-VM, >> >> >> > and yes this means isolation and accounting with cgroups need to >> >> >> > work properly. >> >> >> >> >> >> Agree, but even if you use a single multiqueue device having the >> >> >> ability to use 1 thread to serve all the queues or multiple threads to >> >> >> serve all the queues looks like a useful feature. >> >> > >> >> > Could be. At the moment, multiqueue is off by default because it causes >> >> > regressions for some workloads as compared to a single queue. >> >> > If we have heuristics in vhost that fix this by auto-tuning threading, that >> >> > would be nice. But if you need to tune it manually anyway, >> >> > then from upstream perspective it does not seem to be worth it - you can just >> >> > turn multiqueue on/off in the guest. >> >> >> >> I see. But we are mixing again between the policy and the mechanism. >> >> We first need a mechanism to control the system and then we need to >> >> implement the policy to orchestrate it (whenever it will be >> >> implemented in the kernel as part of vhost or outside in user-space). >> >> I don't see why to wait to have a policy to upstream the mechanism. If >> >> we upstream the mechanism in a manner that the defaults do not affect >> >> today's vhost behavior, then it will be possible to play with the >> >> policies and upstream them later. >> > >> > Well it all hinges on whether it's in userspace actually. >> > Interfaces to userspace must be maintained forever, there's no way to know >> > they are right and make sense before they are used, so we must not merge >> > the interface before there is userspace using it correctly. >> > >> > If it's in-kernel, we can merge a bit of unused code into kernel >> > just to make it easier for you to make progress, provided it's >> > well isolated and doesn't make life much harder for others >> > working on same code-base. >> >> I see your concerns. We can also expose the mechanism as exported >> symbols and implement the policy in a different kernel module. > > Just in case I didn't make this clear previously - my gut feeling is > that the split between mechanism and policy is in the wrong place > here. Policy is what to do, mechanism is how to do it. > What is the policy in aid of? What is the "what to do" here? "process packets > more efficiently" right? Whether to poll and which threads process which > packets sounds more like how to do it - a mechanism. > >> Is that >> OK ? or you are also concerned about exporting symbols to other >> kernel modules because they may change as well ? > > Sorry, I do not understand what is suggested here. Export a set of control functions from the vhost kernel module so a new kernel module can call this functions to orchestrate/manage vhost. In other words, keep the control interface internally in the kernel so other kernel module can use them to implement the policy > >> > >> >> > >> >> > >> >> >> > >> >> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore >> >> >> >> further. >> >> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some >> >> >> >> of the issues that were discussed, such as cgroups, are not supported by >> >> >> >> cwmq, and this is adding more complexity. >> >> >> >> However, we'll look further into it, and consider it at a later stage. >> >> >> > >> >> >> > Hmm that's still assuming some smart management tool configuring >> >> >> > this correctly. Can't this be determined automatically depending >> >> >> > on the workload? >> >> >> > This is what the cwmq suggestion was really about: detect >> >> >> > that we need more threads and spawn them. >> >> >> > It's less about sharing the implementation with workqueues - >> >> >> > would be very nice but not a must. >> >> >> >> >> >> But how cwmq can consider cgroup accounting ? >> >> > >> >> > I think cwmq is just a replacement for our own thread pool. >> >> > It doesn't make cgroup accounting easier or harder. >> >> >> >> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will >> >> be much simpler consider cgrouups with our own threading model. >> > >> > Bandan here arrived at the same conclusion. >> >> Same conclusion means no cwmq ? > > Yes. > >> > >> >> > >> >> >> In any case, IMHO, the kernel should first provide the "mechanism" so >> >> >> later on a user-space management application (the "policy") can >> >> >> orchestrate it. >> >> > >> >> > I think policy would be something coarse-grained, like setting priority. >> >> > Making detailed scheduling decisions in userspace seems wrong somehow: >> >> > what does management application know that kernel doesn't? >> >> >> >> The code of the management application can be >> >> changed/modified/customized... also the management application can >> >> consider additional inputs (like per VM SLAs) that are not available >> >> to the kernel. >> > >> > An alternative is to express these in terms of cgroups. >> >> True but there is a semantic gap between cgroups (low level system >> metrics) and SLAs (high level application performance metrics). I >> don't believe everything can be expressed as cgroups... > > Not existing cgroups but if there's another resource > we need to partition it can be expressed as another kind of a cgroup > I think. > >> > >> >> Think about MOM (memory over-commit manager) but for vhost (vhost I/O >> >> manager ?VHIOM ?) >> > >> > I worry that CPU and network load is much more dynamic than >> > memory load, and that in real life userspace simply won't be able to >> > react to changes fast enough. >> > Also that this will interact with scheduler in strange ways, e.g. >> > manager wants to free up the CPU and moves some vhost threads >> > off it, scheduler sees there's free CPU and moves some more jobs there. >> >> I believe we can make very good decisions in user-space but obviously >> this require some research and experimentation. I think Eyal Moscovici >> is working on this. >> >> > After all it might be the best we can do for over-committed systems, >> > but it seems that it will have to be inter-VM to be really useful. >> >> Agree 100% .The goal is to have inter-VM support. Single VM is the >> first step to implement inter-VM. > > So I have my doubts about this being mergeable at the single VM stage. > But we'll see. > > >> > >> > >> > Overall one of the main points of vhost is a single interface >> > that let us access all kind of kernel APIs in a uniform way: >> > we tell kernel what we want it to do, and let it get about >> > executing that in the most efficient manner possible. >> > If you insist on managing everything from userspace you lose >> > some of these advantages. >> > That's why I keep suggesting we try to give scheduler more hints >> > about our expected behaviour. >> > >> >> > >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> 3. Adding polling mode to vhost >> >> >> >> >> >> >> >> It is a good idea making polling adaptive based on various factors such as >> >> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling), >> >> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was >> >> >> >> added). >> >> >> >> However, as a beginning polling patch, we would prefer having a naive >> >> >> >> polling approach, which could be tuned with later patches. >> >> >> >> >> >> >> > >> >> >> > While any polling approach would still need a lot of testing to prove we >> >> >> > don't for example steal CPU from guest which could be doing other useful >> >> >> > work, given that an exit is at least 1.5K cycles at least in theory it >> >> >> > seems like something that can improve performance. I'm not sure how >> >> >> > naive we can be without introducing regressions for some workloads. >> >> >> > For example, if we are on the same host CPU, there's no >> >> >> > chance busy waiting will help us make progress. >> >> >> > How about detecting that the VCPU thread that kicked us >> >> >> > is currently running on another CPU, and only polling in >> >> >> > this case? >> >> >> > >> >> >> >> 4. vhost statistics >> >> >> >> >> >> >> >> The issue that was raised for the vhost statistics was using ftrace >> >> >> >> instead of the debugfs mechanism. >> >> >> >> However, looking further into the kvm stat mechanism, we learned that >> >> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in >> >> >> >> addition to it. >> >> >> >> >> >> >> >> We propose to continue using debugfs for statistics, in a manner similar >> >> >> >> to kvm, >> >> >> >> and at some point in the future ftrace can be added to vhost as well. >> >> >> > >> >> >> > IMHO which kvm stat is a useful script, the best tool >> >> >> > for perf stats is still perf. So I would try to integrate with that. >> >> >> > How it works internally is IMHO less important. >> >> >> > >> >> >> >> Does this plan look o.k.? >> >> >> >> If there are no further comments, I'll start preparing the patches >> >> >> >> according to what we've agreed on thus far. >> >> >> >> Thank you, >> >> >> >> Razya >> >> >> > >> >> >> > I think a good place to try to start merging upstream would be 3 and 4. >> >> >> > So if you want to make it easier to merge things upstream, try to keep 3 >> >> >> > and 4 independent from 1 and 2. >> >> >> >> >> >> Note -1- and -3- are strongly related. If you have a thread that >> >> >> serves multiple queues (whenever they belong to the same device/vm or >> >> >> not) then this thread will be polling multiple queues at the same >> >> >> time. This increases the chances you will find pending work to do in >> >> >> some queue. In other words, you reduce the cycles wasted for polling. >> >> >> In the other hand, if you run multiple threads and these threads do >> >> >> polling simultaneously then the threads may starve each other and >> >> >> reduce performance (if they are scheduled to run in the same core). In >> >> >> addition, a shared thread can decide when it should stop processing a >> >> >> given queue and switch to other queue because by polling the thread >> >> >> knows when new requests were added to a queue (this is what we called >> >> >> fine-grained I/O scheduled heuristics) >> >> >> >> >> >> So, seems like polling makes more sense when you serve multiple queues >> >> >> with the same thread. >> >> >> >> >> >> Abel. >> >> > >> >> > A combination might bring gains in more workloads, but it should work on its >> >> > own too. It's quite possible that only a single VM is active, others are >> >> > idle. So either polling should handle that well or be smart enough to turn >> >> > itself off in this case. >> >> >> >> Agree, but we are discussing again about the policy. We first need >> >> the mechanism :) >> >> >> >> Abel. >> > >> > So you want to control this from a management application as well? >> > That's a very low level detail to expose to management. >> > >> > Why do we ever want to poll? It burns CPU uselessly. >> > There are several reasons for polling to be good: >> > >> > 1. scheduler overhead is too large, we are thrashing >> > between multiple threads running on the same CPU. >> > This is the one you are thinking of I guess: it can >> > be controlled from the management application. >> > So here some management interface might make some sense. >> > But to know how long to poll you would need to know >> > what the scheduler overhead is, then poll for >> > a fraction of that. >> > >> > 2. KVM exit overhead is too large, we are wasting cycles >> > on exit/reentry. >> > Userspace doesn't easily know about exits though. >> > Again, we'd need to measure how long does an exit take, >> > then poll for a fraction of that. >> > This one is easier to measure from guest, so maybe >> > we should let guest control this. >> > >> > 3. If we halt the CPU it might take too long to wake up. >> > If we expect to be woken up soon we'd be better off >> > busy waiting. >> > This looks like it would be better addressed by an in-kernel API >> > that says "I expect to be woken up in X microseconds". >> > This way if some other event occurs instead, it will be >> > handled instead of waiting for us to finish polling. >> > >> > As you see there are many factors in play here. >> > That's why we can't just add userspace interface "poll for X cycles": >> > to support item 1. maintaining this interface would block >> > support for items 2 and 3. >> >> It doesn't need to be pure user-space. Some parts (like "when" to do >> polling) can be controlled in kernel but other parts (like number of >> threads) can be controlled by user-space. Also some of the polling >> settings (completely disable polling, min/max polling limitations...) >> can be controlled by user-space > > So above concern would apply to min polling as well. > max polling could be expressed as some kind of latency/responsiveness > requirement and in that case it might start to make some sense. > >> > >> > >> >> > >> >> >> > >> >> >> > Thanks again, >> >> >> > >> >> >> > -- >> >> >> > MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html