Re: Updated Elvis Upstreaming Roadmap

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 19, 2013 at 3:48 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> On Thu, Dec 19, 2013 at 02:56:10PM +0200, Abel Gordon wrote:
>> On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>> > On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
>> >> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>> >> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
>> >> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>> >> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> Thank you all for your comments.
>> >> >> >> I'm sorry for taking this long to reply, I was away on vacation..
>> >> >> >>
>> >> >> >> It was a good, long discussion, many issues were raised, which we'd like
>> >> >> >> to address with the following proposed roadmap for Elvis patches.
>> >> >> >> In general, we believe it would be best to start with patches that are
>> >> >> >> as simple as possible, providing the basic Elvis functionality,
>> >> >> >> and attend to the more complicated issues in subsequent patches.
>> >> >> >>
>> >> >> >> Here's the road map for Elvis patches:
>> >> >> >
>> >> >> > Thanks for the follow up. Some suggestions below.
>> >> >> > Please note they suggestions below merely represent
>> >> >> > thoughts on merging upstream.
>> >> >> > If as the first step you are content with keeping this
>> >> >> > work as out of tree patches, in order to have
>> >> >> > the freedom to experiment with interfaces and
>> >> >> > performance, please feel free to ignore them.
>> >> >> >
>> >> >> >> 1. Shared vhost thread for multiple devices.
>> >> >> >>
>> >> >> >> The way to go here, we believe, is to start with a patch having a shared
>> >> >> >> vhost thread for multiple devices of the SAME vm.
>> >> >> >> The next step/patch may be handling vms belonging to the same cgroup.
>> >> >> >>
>> >> >> >> Finally, we need to extend the functionality so that the shared vhost
>> >> >> >> thread
>> >> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
>> >> >> >>
>> >> >> >> There was a lot of discussion about the way to address the enforcement
>> >> >> >> of cgroup policies, and we will consider the various solutions with a
>> >> >> >> future
>> >> >> >> patch.
>> >> >> >
>> >> >> > With respect to the upstream kernel,
>> >> >> > I'm not sure a bunch of changes just for the sake of guests with
>> >> >> > multiple virtual NIC cards makes sense.
>> >> >> > And I wonder how this step, in isolation, will affect e.g.
>> >> >> > multiqueue workloads.
>> >> >> > But I guess if the numbers are convincing, this can be mergeable.
>> >> >>
>> >> >> Even if you have a single multiqueue device this change allows
>> >> >> to create one vhost thread for all the queues, one vhost thread per
>> >> >> queue or any other combination. I guess that depending on the workload
>> >> >> and depending on the system utilization (free cycles/cores, density)
>> >> >> you would prefer
>> >> >> to use one or more vhost threads.
>> >> >
>> >> > That is already controllable from the guest though, which likely has a better
>> >> > idea about the workload.
>> >>
>> >> but the guest has no idea about what's going on in the host system
>> >> (e.g. other VMs I/O, cpu utilization of the host cores...)
>> >
>> > But again, you want to do things per VM now so you will have no idea
>> > about other VMs, right? Host cpu utilization could be a useful input
>>
>> Razya shared a roadmap. The first step was to support sharing a thread
>> for a single VM but the goal is to later on extend the mechanism to
>> support multiple VMs and cgroups
>
> Yes, I got that. What I'm not sure of is whether this is just a
> development roadmap, or do you expect to be able to merge things
> upstream in this order as well.
> If the later, all I'm saying is that I think you are doing this in the wrong
> order: we'll likely have to merge first 4 then 3 then possibly 1+2 together -
> but maybe 1+2 will have to wait until cgroups are sorted out.
> That's just a hunch of course until you actually try to do it.
> If the former, most of my comments don't really apply.
>
>> > for some heuristics, I agree, but nothing prevents us from sending
>> > this info to guest agent and controlling multiqueue based on that
>> > (kind of like balloon).
>>
>> IMHO, we should never share host internal information (e.g. resource
>> utilization) wit the guest. That's supposed to be confidential
>> information :)
>> The balloon is a bit different... kvm asks the guest OS to give  (if
>> possible) some pages but  kvm never sends to the balloon information
>> about the memory utilization of the host. If the guest wishes to send
>> information about it's own memory consumption (like it does for MOM),
>> that's OK.
>>
>> So, the guest can share information with the host but the host should
>> be the one to make the decisions. KVM should never share host
>> information with the guest.
>
> It's also easy to just tell guest agent to turn multiqueue on/off if you have
> a mind to.
>
>
>> >
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> 2. Creation of vhost threads
>> >> >> >>
>> >> >> >> We suggested two ways of controlling the creation and removal of vhost
>> >> >> >> threads:
>> >> >> >> - statically determining the maximum number of virtio devices per worker
>> >> >> >> via a kernel module parameter
>> >> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
>> >> >> >>
>> >> >> >> It seems that it would be simplest to take the static approach as
>> >> >> >> a first stage. At a second stage (next patch), we'll advance to
>> >> >> >> dynamically
>> >> >> >> changing the number of vhost threads, using the static module parameter
>> >> >> >> only as a default value.
>> >> >> >
>> >> >> > I'm not sure how independent this is from 1.
>> >> >> > With respect to the upstream kernel,
>> >> >> > Introducing interfaces (which we'll have to maintain
>> >> >> > forever) just for the sake of guests with
>> >> >> > multiple virtual NIC cards does not look like a good tradeoff.
>> >> >> >
>> >> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
>> >> >> > and yes this means isolation and accounting with cgroups need to
>> >> >> > work properly.
>> >> >>
>> >> >> Agree, but even if you use a single multiqueue device having the
>> >> >> ability to use 1 thread to serve all the queues or multiple threads to
>> >> >> serve all the queues looks like a useful feature.
>> >> >
>> >> > Could be.  At the moment, multiqueue is off by default because it causes
>> >> > regressions for some workloads as compared to a single queue.
>> >> > If we have heuristics in vhost that fix this by auto-tuning threading, that
>> >> > would be nice.  But if you need to tune it manually anyway,
>> >> > then from upstream perspective it does not seem to be worth it - you can just
>> >> > turn multiqueue on/off in the guest.
>> >>
>> >> I see. But we are mixing again between the policy and the mechanism.
>> >> We first need a mechanism to control the system and then we need to
>> >> implement the policy to orchestrate it (whenever it will be
>> >> implemented in the kernel as part of vhost or outside in user-space).
>> >> I don't see why to wait to have a policy to upstream the mechanism. If
>> >> we upstream the mechanism in a manner that the defaults do not affect
>> >> today's vhost behavior, then it will be possible to play with the
>> >> policies and upstream them later.
>> >
>> > Well it all hinges on whether it's in userspace actually.
>> > Interfaces to userspace must be maintained forever, there's no way to know
>> > they are right and make sense before they are used,  so we must not merge
>> > the interface before there is userspace using it correctly.
>> >
>> > If it's in-kernel, we can merge a bit of unused code into kernel
>> > just to make it easier for you to make progress, provided it's
>> > well isolated and doesn't make life much harder for others
>> > working on same code-base.
>>
>> I see your concerns. We can also expose the mechanism as exported
>> symbols and implement the policy in a different kernel module.
>
> Just in case I didn't make this clear previously - my gut feeling is
> that the split between mechanism and policy is in the wrong place
> here.  Policy is what to do, mechanism is how to do it.
> What is the policy in aid of? What is the "what to do" here? "process packets
> more efficiently" right?  Whether to poll and which threads process which
> packets sounds more like how to do it - a mechanism.
>
>> Is that
>> OK ? or you are also concerned about  exporting symbols to other
>> kernel modules because they may change as well ?
>
> Sorry, I do not understand what is suggested here.

Export a set of control functions from the vhost kernel module so a
new  kernel module can call this functions to orchestrate/manage
vhost.
In other words, keep the control interface internally in the kernel so
other kernel module can use them to implement the policy

>
>> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
>> >> >> >> further.
>> >> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
>> >> >> >> of the issues that were discussed, such as cgroups, are not supported by
>> >> >> >> cwmq, and this is adding more complexity.
>> >> >> >> However, we'll look further into it, and consider it at a later stage.
>> >> >> >
>> >> >> > Hmm that's still assuming some smart management tool configuring
>> >> >> > this correctly.  Can't this be determined automatically depending
>> >> >> > on the workload?
>> >> >> > This is what the cwmq suggestion was really about: detect
>> >> >> > that we need more threads and spawn them.
>> >> >> > It's less about sharing the implementation with workqueues -
>> >> >> > would be very nice but not a must.
>> >> >>
>> >> >> But how cwmq can consider cgroup accounting ?
>> >> >
>> >> > I think cwmq is just a replacement for our own thread pool.
>> >> > It doesn't make cgroup accounting easier or harder.
>> >>
>> >> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
>> >> be much simpler consider cgrouups with our own threading model.
>> >
>> > Bandan here arrived at the same conclusion.
>>
>> Same conclusion means no cwmq  ?
>
> Yes.
>
>> >
>> >> >
>> >> >> In any case, IMHO, the kernel should first provide the "mechanism" so
>> >> >> later on a user-space management application (the "policy") can
>> >> >> orchestrate it.
>> >> >
>> >> > I think policy would be something coarse-grained, like setting priority.
>> >> > Making detailed scheduling decisions in userspace seems wrong somehow:
>> >> > what does management application know that kernel doesn't?
>> >>
>> >> The code of the management application can be
>> >> changed/modified/customized... also the management application can
>> >> consider additional inputs (like per VM SLAs) that are not available
>> >> to the kernel.
>> >
>> > An alternative is to express these in terms of cgroups.
>>
>> True but there is a semantic gap between cgroups (low level system
>> metrics) and SLAs (high level application performance metrics). I
>> don't believe everything can be expressed as cgroups...
>
> Not existing cgroups but if there's another resource
> we need to partition it can be expressed as another kind of a cgroup
> I think.
>
>> >
>> >> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
>> >> manager ?VHIOM ?)
>> >
>> > I worry that CPU and network load is much more dynamic than
>> > memory load, and that in real life userspace simply won't be able to
>> > react to changes fast enough.
>> > Also that this will interact with scheduler in strange ways, e.g.
>> > manager wants to free up the CPU and moves some vhost threads
>> > off it, scheduler sees there's free CPU and moves some more jobs there.
>>
>> I believe we can make very good decisions in user-space but obviously
>> this require some research and experimentation. I think Eyal Moscovici
>> is working on this.
>>
>> > After all it might be the best we can do for over-committed systems,
>> > but it seems that it will have to be inter-VM to be really useful.
>>
>> Agree 100% .The goal is to have inter-VM support. Single VM is the
>> first step to implement inter-VM.
>
> So I have my doubts about this being mergeable at the single VM stage.
> But we'll see.
>
>
>> >
>> >
>> > Overall one of the main points of vhost is a single interface
>> > that let us access all kind of kernel APIs in a uniform way:
>> > we tell kernel what we want it to do, and let it get about
>> > executing that in the most efficient manner possible.
>> > If you insist on managing everything from userspace you lose
>> > some of these advantages.
>> > That's why I keep suggesting we try to give scheduler more hints
>> > about our expected behaviour.
>> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >> 3. Adding polling mode to vhost
>> >> >> >>
>> >> >> >> It is a good idea making polling adaptive based on various factors such as
>> >> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
>> >> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
>> >> >> >> added).
>> >> >> >> However, as a beginning polling patch, we would prefer having a naive
>> >> >> >> polling approach, which could be tuned with later patches.
>> >> >> >>
>> >> >> >
>> >> >> > While any polling approach would still need a lot of testing to prove we
>> >> >> > don't for example steal CPU from guest which could be doing other useful
>> >> >> > work, given that an exit is at least 1.5K cycles at least in theory it
>> >> >> > seems like something that can improve performance.  I'm not sure how
>> >> >> > naive we can be without introducing regressions  for some workloads.
>> >> >> > For example, if we are on the same host CPU, there's no
>> >> >> > chance busy waiting will help us make progress.
>> >> >> > How about detecting that the VCPU thread that kicked us
>> >> >> > is currently running on another CPU, and only polling in
>> >> >> > this case?
>> >> >> >
>> >> >> >> 4. vhost statistics
>> >> >> >>
>> >> >> >> The issue that was raised for the vhost statistics was using ftrace
>> >> >> >> instead of the debugfs mechanism.
>> >> >> >> However, looking further into the kvm stat mechanism, we learned that
>> >> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
>> >> >> >> addition to it.
>> >> >> >>
>> >> >> >> We propose to continue using debugfs for statistics, in a manner similar
>> >> >> >> to kvm,
>> >> >> >> and at some point in the future ftrace can be added to vhost as well.
>> >> >> >
>> >> >> > IMHO which kvm stat is a useful script, the best tool
>> >> >> > for perf stats is still perf. So I would try to integrate with that.
>> >> >> > How it works internally is IMHO less important.
>> >> >> >
>> >> >> >> Does this plan look o.k.?
>> >> >> >> If there are no further comments, I'll start preparing the patches
>> >> >> >> according to what we've agreed on thus far.
>> >> >> >> Thank you,
>> >> >> >> Razya
>> >> >> >
>> >> >> > I think a good place to try to start merging upstream would be 3 and 4.
>> >> >> > So if you want to make it easier to merge things upstream, try to keep 3
>> >> >> > and 4 independent from 1 and 2.
>> >> >>
>> >> >> Note -1- and -3- are strongly related. If you have a thread that
>> >> >> serves multiple queues (whenever they belong to the same device/vm or
>> >> >> not) then this thread will be polling multiple queues at the same
>> >> >> time. This increases the chances you will find pending work to do in
>> >> >> some queue. In other words, you reduce the cycles wasted for polling.
>> >> >> In the other hand,  if you run multiple threads and these threads do
>> >> >> polling simultaneously then the threads may starve each other and
>> >> >> reduce performance (if they are scheduled to run in the same core). In
>> >> >> addition, a shared thread can decide when it should stop processing a
>> >> >> given queue and switch to other queue because by polling the thread
>> >> >> knows when new requests were added to a queue (this is what we called
>> >> >> fine-grained I/O scheduled heuristics)
>> >> >>
>> >> >> So, seems like polling makes more sense when you serve multiple queues
>> >> >> with the same thread.
>> >> >>
>> >> >> Abel.
>> >> >
>> >> > A combination might bring gains in more workloads, but it should work on its
>> >> > own too. It's quite possible that only a single VM is active, others are
>> >> > idle. So either polling should handle that well or be smart enough to turn
>> >> > itself off in this case.
>> >>
>> >> Agree, but we are  discussing again about the policy. We first need
>> >> the mechanism :)
>> >>
>> >> Abel.
>> >
>> > So you want to control this from a management application as well?
>> > That's a very low level detail to expose to management.
>> >
>> > Why do we ever want to poll? It burns CPU uselessly.
>> > There are several reasons for polling to be good:
>> >
>> >         1. scheduler overhead is too large, we are thrashing
>> >           between multiple threads running on the same CPU.
>> >           This is the one you are thinking of I guess: it can
>> >           be controlled from the management application.
>> >           So here some management interface might make some sense.
>> >           But to know how long to poll you would need to know
>> >           what the scheduler overhead is, then poll for
>> >           a fraction of that.
>> >
>> >         2. KVM exit overhead is too large, we are wasting cycles
>> >           on exit/reentry.
>> >           Userspace doesn't easily know about exits though.
>> >           Again, we'd need to measure how long does an exit take,
>> >           then poll for a fraction of that.
>> >           This one is easier to measure from guest, so maybe
>> >           we should let guest control this.
>> >
>> >         3. If we halt the CPU it might take too long to wake up.
>> >           If we expect to be woken up soon we'd be better off
>> >           busy waiting.
>> >           This looks like it would be better addressed by an in-kernel API
>> >           that says "I expect to be woken up in X microseconds".
>> >           This way if some other event occurs instead, it will be
>> >           handled instead of waiting for us to finish polling.
>> >
>> > As you see there are many factors in play here.
>> > That's why we can't just add userspace interface "poll for X cycles":
>> > to support item 1.  maintaining this interface would block
>> > support for items 2 and 3.
>>
>> It doesn't need to be pure user-space. Some parts (like "when" to do
>> polling) can be controlled in kernel but other parts (like number of
>> threads) can be controlled by user-space.  Also some of the polling
>> settings  (completely disable polling, min/max polling limitations...)
>> can be controlled  by user-space
>
> So above concern would apply to min polling as well.
> max polling could be expressed as some kind of latency/responsiveness
> requirement and in that case it might start to make some sense.
>
>> >
>> >
>> >> >
>> >> >> >
>> >> >> > Thanks again,
>> >> >> >
>> >> >> > --
>> >> >> > MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux