On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote: > On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote: > >> Hi, > >> > >> Thank you all for your comments. > >> I'm sorry for taking this long to reply, I was away on vacation.. > >> > >> It was a good, long discussion, many issues were raised, which we'd like > >> to address with the following proposed roadmap for Elvis patches. > >> In general, we believe it would be best to start with patches that are > >> as simple as possible, providing the basic Elvis functionality, > >> and attend to the more complicated issues in subsequent patches. > >> > >> Here's the road map for Elvis patches: > > > > Thanks for the follow up. Some suggestions below. > > Please note they suggestions below merely represent > > thoughts on merging upstream. > > If as the first step you are content with keeping this > > work as out of tree patches, in order to have > > the freedom to experiment with interfaces and > > performance, please feel free to ignore them. > > > >> 1. Shared vhost thread for multiple devices. > >> > >> The way to go here, we believe, is to start with a patch having a shared > >> vhost thread for multiple devices of the SAME vm. > >> The next step/patch may be handling vms belonging to the same cgroup. > >> > >> Finally, we need to extend the functionality so that the shared vhost > >> thread > >> serves multiple vms (not necessarily belonging to the same cgroup). > >> > >> There was a lot of discussion about the way to address the enforcement > >> of cgroup policies, and we will consider the various solutions with a > >> future > >> patch. > > > > With respect to the upstream kernel, > > I'm not sure a bunch of changes just for the sake of guests with > > multiple virtual NIC cards makes sense. > > And I wonder how this step, in isolation, will affect e.g. > > multiqueue workloads. > > But I guess if the numbers are convincing, this can be mergeable. > > Even if you have a single multiqueue device this change allows > to create one vhost thread for all the queues, one vhost thread per > queue or any other combination. I guess that depending on the workload > and depending on the system utilization (free cycles/cores, density) > you would prefer > to use one or more vhost threads. That is already controllable from the guest though, which likely has a better idea about the workload. > > > >> > >> 2. Creation of vhost threads > >> > >> We suggested two ways of controlling the creation and removal of vhost > >> threads: > >> - statically determining the maximum number of virtio devices per worker > >> via a kernel module parameter > >> - dynamically: Sysfs mechanism to add and remove vhost threads > >> > >> It seems that it would be simplest to take the static approach as > >> a first stage. At a second stage (next patch), we'll advance to > >> dynamically > >> changing the number of vhost threads, using the static module parameter > >> only as a default value. > > > > I'm not sure how independent this is from 1. > > With respect to the upstream kernel, > > Introducing interfaces (which we'll have to maintain > > forever) just for the sake of guests with > > multiple virtual NIC cards does not look like a good tradeoff. > > > > So I'm unlikely to merge this upstream without making it useful cross-VM, > > and yes this means isolation and accounting with cgroups need to > > work properly. > > Agree, but even if you use a single multiqueue device having the > ability to use 1 thread to serve all the queues or multiple threads to > serve all the queues looks like a useful feature. Could be. At the moment, multiqueue is off by default because it causes regressions for some workloads as compared to a single queue. If we have heuristics in vhost that fix this by auto-tuning threading, that would be nice. But if you need to tune it manually anyway, then from upstream perspective it does not seem to be worth it - you can just turn multiqueue on/off in the guest. > > > >> Regarding cwmq, it is an interesting mechanism, which we need to explore > >> further. > >> At the moment we prefer not to change the vhost model to use cwmq, as some > >> of the issues that were discussed, such as cgroups, are not supported by > >> cwmq, and this is adding more complexity. > >> However, we'll look further into it, and consider it at a later stage. > > > > Hmm that's still assuming some smart management tool configuring > > this correctly. Can't this be determined automatically depending > > on the workload? > > This is what the cwmq suggestion was really about: detect > > that we need more threads and spawn them. > > It's less about sharing the implementation with workqueues - > > would be very nice but not a must. > > But how cwmq can consider cgroup accounting ? I think cwmq is just a replacement for our own thread pool. It doesn't make cgroup accounting easier or harder. > In any case, IMHO, the kernel should first provide the "mechanism" so > later on a user-space management application (the "policy") can > orchestrate it. I think policy would be something coarse-grained, like setting priority. Making detailed scheduling decisions in userspace seems wrong somehow: what does management application know that kernel doesn't? > > > > > > > >> 3. Adding polling mode to vhost > >> > >> It is a good idea making polling adaptive based on various factors such as > >> the I/O rate, the guest kick overhead(which is the tradeoff of polling), > >> or the amount of wasted cycles (cycles we kept polling but no new work was > >> added). > >> However, as a beginning polling patch, we would prefer having a naive > >> polling approach, which could be tuned with later patches. > >> > > > > While any polling approach would still need a lot of testing to prove we > > don't for example steal CPU from guest which could be doing other useful > > work, given that an exit is at least 1.5K cycles at least in theory it > > seems like something that can improve performance. I'm not sure how > > naive we can be without introducing regressions for some workloads. > > For example, if we are on the same host CPU, there's no > > chance busy waiting will help us make progress. > > How about detecting that the VCPU thread that kicked us > > is currently running on another CPU, and only polling in > > this case? > > > >> 4. vhost statistics > >> > >> The issue that was raised for the vhost statistics was using ftrace > >> instead of the debugfs mechanism. > >> However, looking further into the kvm stat mechanism, we learned that > >> ftrace didn't replace the plain debugfs mechanism, but was used in > >> addition to it. > >> > >> We propose to continue using debugfs for statistics, in a manner similar > >> to kvm, > >> and at some point in the future ftrace can be added to vhost as well. > > > > IMHO which kvm stat is a useful script, the best tool > > for perf stats is still perf. So I would try to integrate with that. > > How it works internally is IMHO less important. > > > >> Does this plan look o.k.? > >> If there are no further comments, I'll start preparing the patches > >> according to what we've agreed on thus far. > >> Thank you, > >> Razya > > > > I think a good place to try to start merging upstream would be 3 and 4. > > So if you want to make it easier to merge things upstream, try to keep 3 > > and 4 independent from 1 and 2. > > Note -1- and -3- are strongly related. If you have a thread that > serves multiple queues (whenever they belong to the same device/vm or > not) then this thread will be polling multiple queues at the same > time. This increases the chances you will find pending work to do in > some queue. In other words, you reduce the cycles wasted for polling. > In the other hand, if you run multiple threads and these threads do > polling simultaneously then the threads may starve each other and > reduce performance (if they are scheduled to run in the same core). In > addition, a shared thread can decide when it should stop processing a > given queue and switch to other queue because by polling the thread > knows when new requests were added to a queue (this is what we called > fine-grained I/O scheduled heuristics) > > So, seems like polling makes more sense when you serve multiple queues > with the same thread. > > Abel. A combination might bring gains in more workloads, but it should work on its own too. It's quite possible that only a single VM is active, others are idle. So either polling should handle that well or be smart enough to turn itself off in this case. > > > > Thanks again, > > > > -- > > MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html