Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote on 27/11/2013 05:00:53 PM: > From: Stefan Hajnoczi <stefanha@xxxxxxxxx> > To: Joel Nider/Haifa/IBM@IBMIL, > Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>, Abel Gordon/Haifa/ > IBM@IBMIL, abel.gordon@xxxxxxxxx, Anthony Liguori > <anthony@xxxxxxxxxxxxx>, asias@xxxxxxxxxx, digitaleric@xxxxxxxxxx, > Eran Raichstein/Haifa/IBM@IBMIL, gleb@xxxxxxxxxx, > jasowang@xxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, pbonzini@xxxxxxxxxx, Razya > Ladelsky/Haifa/IBM@IBMIL > Date: 27/11/2013 05:00 PM > Subject: Re: Elvis upstreaming plan > > On Wed, Nov 27, 2013 at 09:43:33AM +0200, Joel Nider wrote: > > Hi, > > > > Razya is out for a few days, so I will try to answer the questions as well > > as I can: > > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 26/11/2013 11:11:57 PM: > > > > > From: "Michael S. Tsirkin" <mst@xxxxxxxxxx> > > > To: Abel Gordon/Haifa/IBM@IBMIL, > > > Cc: Anthony Liguori <anthony@xxxxxxxxxxxxx>, abel.gordon@xxxxxxxxx, > > > asias@xxxxxxxxxx, digitaleric@xxxxxxxxxx, Eran Raichstein/Haifa/ > > > IBM@IBMIL, gleb@xxxxxxxxxx, jasowang@xxxxxxxxxx, Joel Nider/Haifa/ > > > IBM@IBMIL, kvm@xxxxxxxxxxxxxxx, pbonzini@xxxxxxxxxx, Razya Ladelsky/ > > > Haifa/IBM@IBMIL > > > Date: 27/11/2013 01:08 AM > > > Subject: Re: Elvis upstreaming plan > > > > > > On Tue, Nov 26, 2013 at 08:53:47PM +0200, Abel Gordon wrote: > > > > > > > > > > > > Anthony Liguori <anthony@xxxxxxxxxxxxx> wrote on 26/11/2013 08:05:00 > > PM: > > > > > > > > > > > > > > Razya Ladelsky <RAZYA@xxxxxxxxxx> writes: > > > > > > > <edit> > > > > > > > > That's why we are proposing to implement a mechanism that will enable > > > > the management stack to configure 1 thread per I/O device (as it is > > today) > > > > or 1 thread for many I/O devices (belonging to the same VM). > > > > > > > > > Once you are scheduling multiple guests in a single vhost device, you > > > > > now create a whole new class of DoS attacks in the best case > > scenario. > > > > > > > > Again, we are NOT proposing to schedule multiple guests in a single > > > > vhost thread. We are proposing to schedule multiple devices belonging > > > > to the same guest in a single (or multiple) vhost thread/s. > > > > > > > > > > I guess a question then becomes why have multiple devices? > > > > If you mean "why serve multiple devices from a single thread" the answer is > > that we cannot rely on the Linux scheduler which has no knowledge of I/O > > queues to do a decent job of scheduling I/O. The idea is to take over the > > I/O scheduling responsibilities from the kernel's thread scheduler with a > > more efficient I/O scheduler inside each vhost thread. So by combining all > > of the I/O devices from the same guest (disks, network cards, etc) in a > > single I/O thread, it allows us to provide better scheduling by giving us > > more knowledge of the nature of the work. So now instead of relying on the > > linux scheduler to perform context switches between multiple vhost threads, > > we have a single thread context in which we can do the I/O scheduling more > > efficiently. We can closely monitor the performance needs of each queue of > > each device inside the vhost thread which gives us much more information > > than relying on the kernel's thread scheduler. > > And now there are 2 performance-critical pieces that need to be > optimized/tuned instead of just 1: > > 1. Kernel infrastructure that QEMU and vhost use today but you decided > to bypass. > 2. The new ELVIS code which only affects vhost devices in the same VM. > > If you split the code paths it results in more effort in the long run > and the benefit seems quite limited once you acknowledge that isolation > is important. Yes you are correct that there are now 2 performance-critical pieces of code. However what we are proposing is just proper module decoupling. I believe you will be hard pressed to make a good case that all of this logic could be integrated into the Linux thread scheduler more efficiently. Think of this as an I/O scheduler for virtualized guests. I don't believe anyone would try to integrate the Linux I/O schedulers with the Linux thread scheduler, even though they are both performance-critical modules? Even if we were to take the route of using these principles to improve the existing scheduler, I have to ask: which scheduler? If we spend this effort on CFS (completely fair scheduler) but then someone switches their thread scheduler to O(1) or some other scheduler, all of our advantage would be lost. We would then have to reimplement for every possible thread scheduler. I don't agree that we are losing isolation, even if you go with the "full ELVIS" which was originally proposed. But that is a discussion for another day. For now, let's agree that in this "reduced ELVIS" solution, no isolation is lost, since each vhost thread is only dealing with I/O from the same guest. As for more effort - for whom do you mean? Development time? Maintenance effort? CPU time? I would say all of those are actually less effort in the long run. Dividing responsibility between modules with well-defined interfaces reduces both development and maintenance effort. If we were to modify the thread scheduler, there would be many corner cases and interactions introduced which may take some time to work out. By separating the responsibility to a different module, we can avoid having to modify a very critical, central piece of code to add functionality for a special-case. This also reduces CPU time since there are fewer threads to be scheduled, and the scheduling algorithm itself doesn't become more complicated with more information about I/O queue lengths, waiting times, priorities, etc. In the optimal case, the vhost threads would be run on dedicated cores with little or no contention, rather than being interspersed with VCPU threads or other Linux process threads. Joel > Isn't the sane thing to do taking lessons from ELVIS improving existing > pieces instead of bypassing them? That way both the single VM and > host-wide performance improves. And as a bonus non-virtualization use > cases may also benefit. > > Stefan > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html