Re: Elvis upstreaming plan

Joel Nider <JOELN@xxxxxxxxxx> · Thu, 28 Nov 2013 09:24:18 +0200

Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote on 27/11/2013 05:00:53 PM:

> From: Stefan Hajnoczi <stefanha@xxxxxxxxx>
> To: Joel Nider/Haifa/IBM@IBMIL,
> Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>, Abel Gordon/Haifa/
> IBM@IBMIL, abel.gordon@xxxxxxxxx, Anthony Liguori
> <anthony@xxxxxxxxxxxxx>, asias@xxxxxxxxxx, digitaleric@xxxxxxxxxx,
> Eran Raichstein/Haifa/IBM@IBMIL, gleb@xxxxxxxxxx,
> jasowang@xxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, pbonzini@xxxxxxxxxx, Razya
> Ladelsky/Haifa/IBM@IBMIL
> Date: 27/11/2013 05:00 PM
> Subject: Re: Elvis upstreaming plan
>
> On Wed, Nov 27, 2013 at 09:43:33AM +0200, Joel Nider wrote:
> > Hi,
> >
> > Razya is out for a few days, so I will try to answer the questions as
well
> > as I can:
> >
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 26/11/2013 11:11:57 PM:
> >
> > > From: "Michael S. Tsirkin" <mst@xxxxxxxxxx>
> > > To: Abel Gordon/Haifa/IBM@IBMIL,
> > > Cc: Anthony Liguori <anthony@xxxxxxxxxxxxx>, abel.gordon@xxxxxxxxx,
> > > asias@xxxxxxxxxx, digitaleric@xxxxxxxxxx, Eran Raichstein/Haifa/
> > > IBM@IBMIL, gleb@xxxxxxxxxx, jasowang@xxxxxxxxxx, Joel Nider/Haifa/
> > > IBM@IBMIL, kvm@xxxxxxxxxxxxxxx, pbonzini@xxxxxxxxxx, Razya Ladelsky/
> > > Haifa/IBM@IBMIL
> > > Date: 27/11/2013 01:08 AM
> > > Subject: Re: Elvis upstreaming plan
> > >
> > > On Tue, Nov 26, 2013 at 08:53:47PM +0200, Abel Gordon wrote:
> > > >
> > > >
> > > > Anthony Liguori <anthony@xxxxxxxxxxxxx> wrote on 26/11/2013
08:05:00
> > PM:
> > > >
> > > > >
> > > > > Razya Ladelsky <RAZYA@xxxxxxxxxx> writes:
> > > > >
> > <edit>
> > > >
> > > > That's why we are proposing to implement a mechanism that will
enable
> > > > the management stack to configure 1 thread per I/O device (as it is
> > today)
> > > > or 1 thread for many I/O devices (belonging to the same VM).
> > > >
> > > > > Once you are scheduling multiple guests in a single vhost device,
you
> > > > > now create a whole new class of DoS attacks in the best case
> > scenario.
> > > >
> > > > Again, we are NOT proposing to schedule multiple guests in a single
> > > > vhost thread. We are proposing to schedule multiple devices
belonging
> > > > to the same guest in a single (or multiple) vhost thread/s.
> > > >
> > >
> > > I guess a question then becomes why have multiple devices?
> >
> > If you mean "why serve multiple devices from a single thread" the
answer is
> > that we cannot rely on the Linux scheduler which has no knowledge of
I/O
> > queues to do a decent job of scheduling I/O.  The idea is to take over
the
> > I/O scheduling responsibilities from the kernel's thread scheduler with
a
> > more efficient I/O scheduler inside each vhost thread.  So by combining
all
> > of the I/O devices from the same guest (disks, network cards, etc) in a
> > single I/O thread, it allows us to provide better scheduling by giving
us
> > more knowledge of the nature of the work.  So now instead of relying on
the
> > linux scheduler to perform context switches between multiple vhost
threads,
> > we have a single thread context in which we can do the I/O scheduling
more
> > efficiently.  We can closely monitor the performance needs of each
queue of
> > each device inside the vhost thread which gives us much more
information
> > than relying on the kernel's thread scheduler.
>
> And now there are 2 performance-critical pieces that need to be
> optimized/tuned instead of just 1:
>
> 1. Kernel infrastructure that QEMU and vhost use today but you decided
> to bypass.
> 2. The new ELVIS code which only affects vhost devices in the same VM.
>
> If you split the code paths it results in more effort in the long run
> and the benefit seems quite limited once you acknowledge that isolation
> is important.

Yes you are correct that there are now 2 performance-critical pieces of
code.  However what we are proposing is just proper module decoupling.  I
believe you will be hard pressed to make a good case that all of this logic
could be integrated into the Linux thread scheduler more efficiently.
Think of this as an I/O scheduler for virtualized guests.  I don't believe
anyone would try to integrate the Linux I/O schedulers with the Linux
thread scheduler, even though they are both performance-critical modules?
Even if we were to take the route of using these principles to improve the
existing scheduler, I have to ask: which scheduler?  If we spend this
effort on CFS (completely fair scheduler) but then someone switches their
thread scheduler to O(1) or some other scheduler, all of our advantage
would be lost.  We would then have to reimplement for every possible thread
scheduler.

I don't agree that we are losing isolation, even if you go with the "full
ELVIS" which was originally proposed.  But that is a discussion for another
day.  For now, let's agree that in this "reduced ELVIS" solution, no
isolation is lost, since each vhost thread is only dealing with I/O from
the same guest.

As for more effort - for whom do you mean?  Development time? Maintenance
effort? CPU time?  I would say all of those are actually less effort in the
long run. Dividing responsibility between modules with well-defined
interfaces reduces both development and maintenance effort. If we were to
modify the thread scheduler, there would be many corner cases and
interactions introduced which may take some time to work out.  By
separating the responsibility to a different module, we can avoid having to
modify a very critical, central piece of code to add functionality for a
special-case.  This also reduces CPU time since there are fewer threads to
be scheduled, and the scheduling algorithm itself doesn't become more
complicated with more information about I/O queue lengths, waiting times,
priorities, etc.  In the optimal case, the vhost threads would be run on
dedicated cores with little or no contention, rather than being
interspersed with VCPU threads or other Linux process threads.

Joel

> Isn't the sane thing to do taking lessons from ELVIS improving existing
> pieces instead of bypassing them?  That way both the single VM and
> host-wide performance improves.  And as a bonus non-virtualization use
> cases may also benefit.
>
> Stefan
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html