Re: [PATCH 0/6] Save state error handling (kill off no_migrate)

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 09 Nov 2010 09:30:45 -0700

On Tue, 2010-11-09 at 18:15 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 09, 2010 at 08:47:00AM -0700, Alex Williamson wrote:
> > > > But it could.  What if ivshmem is acting in a peer role, but has no
> > > > clients, could it migrate?  What if ivshmem is migratable when the
> > > > migration begins, but while the migration continues, a connection is
> > > > setup and it becomes unmigratable.
> > > 
> > > Sounds like something we should work to prevent, not support :)
> > 
> > s/:)/:(/  why?
> 
> It will just confuse everyone. Also if it happens after sending
> all of memory, it's pretty painful.

It happens after sending all of memory with no_migrate, and I think
pushing that earlier might introduce some races around when
register_device_unmigratable() can be called.

> > > >  Using this series, ivshmem would
> > > > have multiple options how to support this.  It could a) NAK the
> > > > migration, b) drop connections and prevent new connections until the
> > > > migration finishes, c) detect that new connections have happened since
> > > > the migration started and cancel.  And probably more.  no_migrate can
> > > > only do a).  And in fact, we can only test no_migrate after the VM is
> > > > stopped (after all memory is migrated) because otherwise it could race
> > > > with devices setting no_migrate during migration.
> > > 
> > > We really want no_migrate to be static. changing it is abusing
> > > the infrastructure.
> > 
> > You call it abusing, I call it making use of the infrastructure.  Why
> > unnecessarily restrict ourselves?  Is return 0/-1 really that scary,
> > unmaintainable, undebuggable?  I don't understand the resistance.
> > 
> > Alex
> 
> management really does not know how to handle unexpected
> migration failures. They must be avoided.
> 
> There are some very special cases that fail migration. They are
> currently easy to find with grep register_device_unmigratable.
> I prefer to keep it that way.

How can management tools be improved to better handle unexpected
migration failures when the only way for qemu to fail is an abort?  We
need the infrastructure to at least return an error first.  Do we just
need to add some fprintfs to the save core to print the id string of the
device that failed to save?  I just can't buy the "code is easier to
grep" as an argument against adding better error handling to the save
code path.  Anyone else want to chime in?

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html