Re: Updated Elvis Upstreaming Roadmap

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Thu, 19 Dec 2013 12:13:28 +0200

On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> >> Hi,
> >>
> >> Thank you all for your comments.
> >> I'm sorry for taking this long to reply, I was away on vacation..
> >>
> >> It was a good, long discussion, many issues were raised, which we'd like
> >> to address with the following proposed roadmap for Elvis patches.
> >> In general, we believe it would be best to start with patches that are
> >> as simple as possible, providing the basic Elvis functionality,
> >> and attend to the more complicated issues in subsequent patches.
> >>
> >> Here's the road map for Elvis patches:
> >
> > Thanks for the follow up. Some suggestions below.
> > Please note they suggestions below merely represent
> > thoughts on merging upstream.
> > If as the first step you are content with keeping this
> > work as out of tree patches, in order to have
> > the freedom to experiment with interfaces and
> > performance, please feel free to ignore them.
> >
> >> 1. Shared vhost thread for multiple devices.
> >>
> >> The way to go here, we believe, is to start with a patch having a shared
> >> vhost thread for multiple devices of the SAME vm.
> >> The next step/patch may be handling vms belonging to the same cgroup.
> >>
> >> Finally, we need to extend the functionality so that the shared vhost
> >> thread
> >> serves multiple vms (not necessarily belonging to the same cgroup).
> >>
> >> There was a lot of discussion about the way to address the enforcement
> >> of cgroup policies, and we will consider the various solutions with a
> >> future
> >> patch.
> >
> > With respect to the upstream kernel,
> > I'm not sure a bunch of changes just for the sake of guests with
> > multiple virtual NIC cards makes sense.
> > And I wonder how this step, in isolation, will affect e.g.
> > multiqueue workloads.
> > But I guess if the numbers are convincing, this can be mergeable.
> 
> Even if you have a single multiqueue device this change allows
> to create one vhost thread for all the queues, one vhost thread per
> queue or any other combination. I guess that depending on the workload
> and depending on the system utilization (free cycles/cores, density)
> you would prefer
> to use one or more vhost threads.

That is already controllable from the guest though, which likely has a better
idea about the workload.

> >
> >>
> >> 2. Creation of vhost threads
> >>
> >> We suggested two ways of controlling the creation and removal of vhost
> >> threads:
> >> - statically determining the maximum number of virtio devices per worker
> >> via a kernel module parameter
> >> - dynamically: Sysfs mechanism to add and remove vhost threads
> >>
> >> It seems that it would be simplest to take the static approach as
> >> a first stage. At a second stage (next patch), we'll advance to
> >> dynamically
> >> changing the number of vhost threads, using the static module parameter
> >> only as a default value.
> >
> > I'm not sure how independent this is from 1.
> > With respect to the upstream kernel,
> > Introducing interfaces (which we'll have to maintain
> > forever) just for the sake of guests with
> > multiple virtual NIC cards does not look like a good tradeoff.
> >
> > So I'm unlikely to merge this upstream without making it useful cross-VM,
> > and yes this means isolation and accounting with cgroups need to
> > work properly.
> 
> Agree, but even if you use a single multiqueue device having the
> ability to use 1 thread to serve all the queues or multiple threads to
> serve all the queues looks like a useful feature.

Could be.  At the moment, multiqueue is off by default because it causes
regressions for some workloads as compared to a single queue.
If we have heuristics in vhost that fix this by auto-tuning threading, that
would be nice.  But if you need to tune it manually anyway,
then from upstream perspective it does not seem to be worth it - you can just
turn multiqueue on/off in the guest.

> >
> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
> >> further.
> >> At the moment we prefer not to change the vhost model to use cwmq, as some
> >> of the issues that were discussed, such as cgroups, are not supported by
> >> cwmq, and this is adding more complexity.
> >> However, we'll look further into it, and consider it at a later stage.
> >
> > Hmm that's still assuming some smart management tool configuring
> > this correctly.  Can't this be determined automatically depending
> > on the workload?
> > This is what the cwmq suggestion was really about: detect
> > that we need more threads and spawn them.
> > It's less about sharing the implementation with workqueues -
> > would be very nice but not a must.
> 
> But how cwmq can consider cgroup accounting ?

I think cwmq is just a replacement for our own thread pool.
It doesn't make cgroup accounting easier or harder.

> In any case, IMHO, the kernel should first provide the "mechanism" so
> later on a user-space management application (the "policy") can
> orchestrate it.

I think policy would be something coarse-grained, like setting priority.
Making detailed scheduling decisions in userspace seems wrong somehow:
what does management application know that kernel doesn't?

> >
> >
> >
> >> 3. Adding polling mode to vhost
> >>
> >> It is a good idea making polling adaptive based on various factors such as
> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
> >> or the amount of wasted cycles (cycles we kept polling but no new work was
> >> added).
> >> However, as a beginning polling patch, we would prefer having a naive
> >> polling approach, which could be tuned with later patches.
> >>
> >
> > While any polling approach would still need a lot of testing to prove we
> > don't for example steal CPU from guest which could be doing other useful
> > work, given that an exit is at least 1.5K cycles at least in theory it
> > seems like something that can improve performance.  I'm not sure how
> > naive we can be without introducing regressions  for some workloads.
> > For example, if we are on the same host CPU, there's no
> > chance busy waiting will help us make progress.
> > How about detecting that the VCPU thread that kicked us
> > is currently running on another CPU, and only polling in
> > this case?
> >
> >> 4. vhost statistics
> >>
> >> The issue that was raised for the vhost statistics was using ftrace
> >> instead of the debugfs mechanism.
> >> However, looking further into the kvm stat mechanism, we learned that
> >> ftrace didn't replace the plain debugfs mechanism, but was used in
> >> addition to it.
> >>
> >> We propose to continue using debugfs for statistics, in a manner similar
> >> to kvm,
> >> and at some point in the future ftrace can be added to vhost as well.
> >
> > IMHO which kvm stat is a useful script, the best tool
> > for perf stats is still perf. So I would try to integrate with that.
> > How it works internally is IMHO less important.
> >
> >> Does this plan look o.k.?
> >> If there are no further comments, I'll start preparing the patches
> >> according to what we've agreed on thus far.
> >> Thank you,
> >> Razya
> >
> > I think a good place to try to start merging upstream would be 3 and 4.
> > So if you want to make it easier to merge things upstream, try to keep 3
> > and 4 independent from 1 and 2.
> 
> Note -1- and -3- are strongly related. If you have a thread that
> serves multiple queues (whenever they belong to the same device/vm or
> not) then this thread will be polling multiple queues at the same
> time. This increases the chances you will find pending work to do in
> some queue. In other words, you reduce the cycles wasted for polling.
> In the other hand,  if you run multiple threads and these threads do
> polling simultaneously then the threads may starve each other and
> reduce performance (if they are scheduled to run in the same core). In
> addition, a shared thread can decide when it should stop processing a
> given queue and switch to other queue because by polling the thread
> knows when new requests were added to a queue (this is what we called
> fine-grained I/O scheduled heuristics)
> 
> So, seems like polling makes more sense when you serve multiple queues
> with the same thread.
> 
> Abel.

A combination might bring gains in more workloads, but it should work on its
own too. It's quite possible that only a single VM is active, others are
idle. So either polling should handle that well or be smart enough to turn
itself off in this case.

> >
> > Thanks again,
> >
> > --
> > MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html