Re: [RFC] Proposed API to support block device streaming

Anthony Liguori <aliguori@xxxxxxxxxxxxxxxxxx> · Mon, 15 Nov 2010 08:24:44 -0600

On 11/15/2010 07:05 AM, Daniel P. Berrange wrote:
Do these calls need to be run before the QEMU process is started,
or after QEMU is already running ?

Streaming requires a running domain and runs concurrently.

What if you have a disk image and want to activate streaming
without running a VM ? eg, so you can ensure the image is
fully downloaded to the host and thus avoid a runtime problem
which would result in IO error for the guest

I hadn't considered off line streaming as a use-case.  Is this more of a 
theoretically consideration or something you would like to see as part 
of the libvirt API?

I'm struggling with understanding the usefulness of it.  If you care 
about streaming offline, you can just do a normal image copy.  It seems 
like this really would only apply to a use-case where you started out 
wanting online streaming, could not complete the streaming, and then 
instead of resuming online streaming, wanted to do offline streaming.

It doesn't seem that practical to me.

If we're streaming the whole disk, is there a way to cancel/abort
it early ?

I was thinking of adding another mode flag for this:
VIR_STREAM_DISK_CANCEL

What happens if qemu-nbd dies before streaming is complete ?

Bad things.  Same as if you deleted a qcow2 backing file.

So a migration lifecycle based on this design has a pretty
dangerous failure mode. The guest can loose access to the
NBD server before the disk copy is complete, and we'd be
unable to switch back to the original QEMU instance since
the target has already started dirtying memory which has
invalidated the source.

Separate out the live migration use-case from the streaming use-case.  
This patch series is just about image streaming.  Here's the expected 
use-case:

I'm a cloud provider and I want to deploy new guests rapidly based on 
template images.  I want the deployed image to reside on local storage 
for the deployed node to avoid excessive network traffic (with high node 
density, the network becomes the bottleneck).

My options today are:

1) Copy the image to the new node.  This infers a huge upfront cost with 
respect to time.  In a cloud environment, rapid provisioning is very 
important so this is a major issue.

2) Use shared storage for the template images and then create a 
copy-on-write image on local storage.  This enables rapid provisioning 
but still uses the network for data reads.  This also requires that the 
template images stay around forever or that you have complicated 
management support for tracking which template images are still in use.

With image streaming, you get rapid provisioning as in (2) but you also 
get to satisfy reads from local storage eliminating pressure on the 
network.  Since streaming gives you a deterministic period where the 
copy-on-write image depends on the template image, it also simplifies 
template image tracking.

In terms of points of failure, image streaming is a bit better than (2) 
because it has two points of failure for a deterministic period of time.

This would be for the block-migration workflow...  I can't see any
particular problem with running qemu-nbd as a regular user.  That's how
I do it when testing.

These last few points are my biggest concern with the API. If we
iteratively add a bunch of APIs for each piece of functionality
involved here, then we'll end up with a migration lifecycle that
requires the app to know about invoking 10's of different API
calls in a perfect sequence. This seems like a very complex and
fragile design for apps to have to deal with.

Migration is a totally different API.  This particular API is focused 
entirely on streaming.  It should not be recommended that it's used to 
enable live migration (even though it's technically possible).

For live migration, I think we really have to look more carefully at the 
libvirt API.  To support post-copy migration in a robust fashion, we 
need to figure out how we want to tunnel the traffic, provide an 
interface to select which devices to migrate, etc.

If we want to be able to use this functionality without requiring
apps to have a direct shell into the host, then we need a set of
APIs for managing NBD server instances for migration, which is
another level of complexity.

A simpler architecture would be to have the NBD server embedded
inside the source QEMU VM, and tunnel the NBD protocol over the
existing migration socket. So QEMU would do a normal migration
of RAM, and when that completes and source QEMU CPUs are stopped,
but QEMU is left running to continue serving the disk data.
This avoids any extra network connections, and avoids having to
add any new APIs to manage NBD servers, and avoids all the
security driver&  lock manger integration problems that the latter
will involve.  If it is critical to free up RAM on the source
host, then the main VM ram area can be munmap()d on the source
once main migration completes, since its not required for the
ongoing NBD data stream.  This kind of architecture means that
apps would need near zero knowledge of disk streaming to make
use of it. The existing virDomainMigrate() would be sufficient,
with an extra flag to request post-migration streaming. There
would still be a probable need for your suggested API to force
immediate streaming of a disk, instead of relying on NBD, but
most apps wouldn't have to care about that if they didn't want
to.

In summary though, I'm not inclined to proceed with adding ad-hoc
APIs for disk streaming to libvirt, without fully considering
the design of a full migration+disk streaming architecture.

Migration is an orthogonal discussion.

In the streaming model, the typical way to support a base image is not 
nbd but NFS.

Streaming is a very different type of functionality than migration and 
trying to lump it together would create an awful lot of user confusion IMHO.

Regards,

Anthony Liguori

Regards,
Daniel

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list