Re: [PATCH 0/6] Save state error handling (kill off no_migrate)

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Tue, 9 Nov 2010 17:07:09 +0200

On Tue, Nov 09, 2010 at 07:58:23AM -0700, Alex Williamson wrote:
> On Tue, 2010-11-09 at 14:00 +0200, Michael S. Tsirkin wrote:
> > On Mon, Nov 08, 2010 at 02:23:37PM -0700, Alex Williamson wrote:
> > > On Mon, 2010-11-08 at 22:59 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Nov 08, 2010 at 10:20:46AM -0700, Alex Williamson wrote:
> > > > > On Mon, 2010-11-08 at 18:54 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Nov 08, 2010 at 07:59:57AM -0700, Alex Williamson wrote:
> > > > > > > On Mon, 2010-11-08 at 13:40 +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Oct 06, 2010 at 02:58:57PM -0600, Alex Williamson wrote:
> > > > > > > > > Our code paths for saving or migrating a VM are full of functions that
> > > > > > > > > return void, leaving no opportunity for a device to cancel a migration,
> > > > > > > > > either from error or incompatibility.  The ivshmem driver attempted to
> > > > > > > > > solve this with a no_migrate flag on the save state entry.  I think the
> > > > > > > > > more generic and flexible way to solve this is to allow driver save
> > > > > > > > > functions to fail.  This series implements that and converts ivshmem
> > > > > > > > > to uses a set_params function to NAK migration much earlier in the
> > > > > > > > > processes.  This touches a lot of files, but bulk of those changes are
> > > > > > > > > simply s/void/int/ and tacking a "return 0" to the end of functions.
> > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Alex
> > > > > > > > 
> > > > > > > > Well error handling is always tricky: it seems easier to
> > > > > > > > require save handlers to never fail.
> > > > > > > 
> > > > > > > Sure it's easier, but does that make it robust?
> > > > > > 
> > > > > > More robust in the face of wwhat kind of failure?
> > > > > 
> > > > > I really don't understand why we're having a discussion about whether
> > > > > providing a means to return an error is a good thing or not.  These
> > > > > patches touch a lot of files, but the change is dead simple.
> > > > 
> > > > I just don't see the motivation. Presumably your patches are
> > > > there to achieve some kind of goal, right? I am trying to
> > > > figure out what that goal is.
> > > 
> > > My goal is that I want to be able to NAK a migration when devices are
> > > assigned, and I think we can do it more generically than the no_migrate
> > > flag so that it supports this application and any other reason that
> > > saves might fail in the future.
> > 
> > More generically but harder to understand and debug, IMO.
> 
> How is returning an error condition hard to understand?  Debugging seems
> easier to me, especially if drivers follow the precedent set in the last
> patch and fprintf the reason for the failure.  Ideally this would be
> some kind of push out to qmp, but it still seems easier than figuring
> out which driver called register_device_unmigratable().
> 
> > > > Currently savevm callbacks never fail. So they
> > > > return void. Why is returing 0 and adding a bunch of code to test the
> > > > condition that never happens a good idea?  It just seems to create more
> > > > ways for devices to shoot themselves in the foot.
> > > 
> > > And more ways to indicate something bad happened and keep running.  We
> > > already have far too many abort() calls in the code.
> > 
> > If you can keep running why can't you migrate?
> 
> Well, as you know device assignment is tied to the hardware, so can't
> migrate, but can always keep running.  The ivshmem driver has a peer
> role, where it's tied to the host memory, so can't migrate, but can keep
> running.

Right. All these are covered with no_migrate flag well enough.
Their inability to migrate does not change at runtime.

> > > > > > > > So there's a bunch of code here but what exactly is the benefit?
> > > > > > > > Since save handlers have no idea what does the remote do,
> > > > > > > > what is the compatibility you mention?
> > > > > > > 
> > > > > > > There are two users I currently have in mind.  ivshmem currently makes
> > > > > > > use of the register_device_unmigratable() because it makes use of host
> > > > > > > specific resources and connections (aiui).  This sets the no_migrate
> > > > > > > flag, which is not dynamic and a bit of a band-aide.
> > > > > > >  The other is
> > > > > > > device assignment, which needs a way to NAK a migration since physical
> > > > > > > devices are never migratable.
> > > > > > 
> > > > > > Well since all these can't be migrated ever, a fixed property actually seems
> > > > > > a good match.  Sure it's not dynamic but all the easier to debug.
> > > > > > 
> > > > > > >  I imagine we could at some point have
> > > > > > > devices with state tied to other features that can't always be detached
> > > > > > > from the host, this tries to provide the infrastructure for that to
> > > > > > > happen.
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > Let guest control whether you can migrate?
> > > > > > Sounds like something that is more likely to be abused
> > > > > > than used constructively. 
> > > > > 
> > > > > s/guest/device/  So you would rather the migration failed on the
> > > > > incoming side where it may not be detected
> > > > 
> > > > And incoming migration handlers *must* validate the input, anyway.
> > > > We should not plaster over this with checks on outgoing side.
> > > 
> > > I'm not in any way suggesting incoming shouldn't do validation.
> > 
> > So that's enough to detect the problem.
> 
> No.  Let's say I have a migration source with an assigned device
> (rombar=0 to even avoid ramblock migration issues), the migration target
> is identical except it doesn't include the assigned device.  pci-assign
> on the source can't NAK a migration because save doesn't currently allow
> error returns.  The target doesn't even have the driver loaded, so
> there's nothing to NAK the load... migration happens and the device
> disappeared, wait for crash.  Maybe we could assume that the user did
> something sane and used pci-assign on the target to match the source,
> then we could NAK the load, but only after we wait for the entire memory
> state of the guest to be transferred.

So set no_migrate flag and that should be enough.

> > > > > or it may be detected too
> > > > > late to stop the migration?
> > > > > 
> > > > > Alex
> > > > 
> > > > So there's a bug and device is in an unexpected state.
> > > > What can we do? Assert, print an error, notify guest - all these
> > > > come to mind. But stop migration? Seems arbitrary.
> > > 
> > > Perhaps the problem is that either an assert or an fprintf are the first
> > > things that come to mind.  We shouldn't have guests randomly blowing up
> > > or telling users to go scan through their log files to find errors.
> > > It's not very hard to allow simple error handling, so why shouldn't our
> > > first plan of attack be to return an error so that the human/qmp monitor
> > > can detect it and inform the user.  For the current candidates for this
> > > interface, there's no point notifying the guest, it's the interface
> > > attempting to do the migration that needs to know there's something
> > > blocking it.
> > > 
> > > Alex
> > 
> > I still don't understand, I am sorry.  When will migration fail?
> > Assigned devices always fail migration so it's not a good example.
> 
> Seems like the perfect example, especially in the scenario above where
> load failure is insufficient.  This is why the no_migrate flag was
> introduced and why ivshmem makes use of it today.  This series starts
> from the assumption that we need a way to NAK a migration, can we do it
> better than the no_migrate flag, generically, and as early as possible
> in the process.
> 
> Alex

no_migrate seems better in that we can check it at any point,
unlike tying it to save callback which can only be invoked
with VM stopped.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html