On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote: > On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote: > > Hi, > > > > We've been working on a program called sync_manager that implements > > shared-storage-based leases to protect shared resources. One way we'd like > > to use it is to protect vm images that reside on shared storage, > > i.e. preventing two vm's on two hosts from using the same image at once. > > There's two different, but related problems here: > > - Preventing 2 different VMs using the same disk > - Preventing the same VM running on 2 hosts at once > > The first requires that there is a lease per configured disk (since > a guest can have multiple disks). The latter requires a lease per > VM and can ignore specifices of what disks are configured. > > IIUC, sync-manager is aiming for the latter. The present integration effort is aiming for the latter. sync_manager itself aims to be agnostic about what it's managing. > > It's functional, and the next big step is using it through libvirt. > > > > sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a > > process, renews the lease wile the process runs, and releases the lease > > when the process exits. While the process runs, it has exclusive access > > to whatever resource was named in the lease that was acquired. > > There are complications around migration we need to consider too. > During migration, you actually need QEMU running on two hosts at > once. IIRC the idea is that before starting the migration operation, > we'd have to tell sync-manager to mark the lease as shared with a > specific host. The destination QEMU would have to startup in shared > mode, and upgrade this to an exclusive lock when migration completes, > or quit when migration fails. sync_manager leases can only be exclusive, so it's a matter of transfering ownership of the exclusive lock from source host to destination host. We have not yet added lease transfer capabilities to sync_manager, but it might look something like this: S = source host, sm-S = sync_manager on S, ... D = destination host, sm-D = sync_manager on D, ... 1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully I don't know enough (anything) about qemu migration yet to say if those steps work correctly or safely. One concern is that qemu-D should not enter a state where it can write until we are certain that D has been written as the lease's owner. > > sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args> > > <lease> defines the shared storage area that sync_manager should > > use for performing the disk-paxos based synchronization. > > It consists of <resource_name>:<path>:<offset>, where > > <resource_name> is likely to be the vm name/uuid (or the > > name of the vm's disk image), and <path>:<offset> is an > > area of shared storage that has been allocated for > > sync_manager to use (a separate area for each resource/vm). > > Can you give some real examples of the lease arg ? I guess <path> must > exclude the ':' character, or have some defined escaping scheme. -l vm0:/dev/vg/lease_area:0 (exclude : from paths) Manually setting up, intializing and keeping track of lease areas would be a pain, so we'll definately be looking at adding that to higher level tools. > The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf > since that's a per-host config. I assume the shared storage area is per > host too ? > > That leaves just the VM name/uuid as a per-VM config option, and we > obviously already have that in XML. Is there actually any extra > attribute we need to track per-guest in the XML ? If not this will > simplify life, because we won't have to track sync-manager specific > attributes With the plugin style hooks you describe below, it seems all the sync_manager config could be kept separate from the libvirt config. > In terms of integration with libvirt, I think it is desirable that we keep > libvirt and sync-manager loosely coupled. ie We don't want to hardcode > libvirt using sync-manager, nor do we want to hardcode sync-manager only > working with libvirt. > > This says to me that we need to provide a well defined plugin system for > providing a 'supervisor process' for QEMU guests. Essentially a dlopen() > module that provides a handful (< 10) callbacks which are triggered in > appropriate codepaths. At minimum I expect we need > > - A callback at ARGV building, to let extra sync-manager ARGV to be injected > - A callback at VM startup. Not needed for sync-manager, but to allowfor > alternate impls that aren't based around supervising. > - A callback at VM shutdown. Just to cleanup resources > - A callback in the VM destroy method, in case we need todo something > different other than just kill($PID) the QEMU $PID. (eg to perhaps > tell sync-manager to kill QEMU instead of killing it ourselves) > - Several callbacks at various stages of migration to deal with > lock downgrade/upgrade sounds good > The one further complication is with the security drivers. IIUC, we will > absolutely not want QEMU to have any access to the shared storage lease > area. The problem is that if we just inject the wrapper process as is, > sync-manager will end up running with exact same privileges as QEMU. > ie same UID:GID, and same selinux context. I'm really not at all sure > how to deal with this problem, because our core design is that the thing > we spawn inherits the privileges we setup at fork() time. We don't want > to delegate the security setup to sync-manager, because it introduces > a huge variable condition in the security system. We need guarenteed > consistent security setup for QEMU, regardless of supervisor process > in use. It might not be a big problem for qemu to write to its own lease area, but writing to another's probably would (e.g. at a different offset on the same lv). That implies a separate lease lv per qemu; I'll have to find out how close that gets to lvm scalability limits. Dave -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list