On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote: > On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote: > > Hi, > > > > This proposal is trying to figure out a solution for migration > > of domain which uses LUN behind vHBA as disk device (QEMU > > emulated disk only at this stage). And other related NPIV > > improvements which are not related with migration. I'm not > > luck to get a environment to test if the thoughts are workable, > > but I'd like see if guys have good idea/suggestions earlier. > > > > 1) Persistent vHBA support > > > > This is the useful stuff missed for long time. Assuming > > that one created a vHBA, did masking/zoning, everything works > > as expected. However, after a system rebooting, everything is > > just lost. If the user wants to get things back, he has to > > find out the preivous WWNN & WWPN, and create the vHBA again. > > > > On the other hand, Persistent vHBA support is actually required > > for domain which uses LUN behind a vHBA. Othewise the domain > > could fail to start after a system rebooting. > > > > To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, > > virNodeDeviceUndefine is required. Also it's useful to introduce > > "autostart" for vHBA, so that the vHBA could be started automatically > > after system rebooting. > > > > Proposed APIs: > > > > virNodeDevicePtr > > virNodeDeviceDefineXML(virConnectPtr conn, > > const char *xml, > > unsigned int flags); > > > > int > > virNodeDeviceUndefine(virConnectPtr conn, > > virNodeDevicePtr dev, > > unsigned int flags); > > > > int > > virNodeDeviceSetAutostart(virNodeDevicePtr dev, > > int autostart, > > unsigned int flags); > > > > int > > virNodeDeviceGetAutostart(virNodeDevicePtr dev, > > int *autostart, > > unsigned int flags); > > I don't really much like this approach. IMHO, this should > all be done via the virStoragePool APIs instead. Adding > define/undefine/autostart to virNodeDevice is really just > duplicating the storage pool functionality. I like the idea of making vHBAs persist as part of pools; how do you envision it should work? Extend the scsi pools to take a vHBA descriptor and then instantiating the vHBA as part of starting the pool, or something else? > > 2) Associate vHBA with domain XML > > > > There are two ways to attach a LUN to a domain: as an QEMU emulated > > device; or passthrough. Since passthrough a LUN is not supported in > > libvirt yet, let's focus on the emulated LUN at this stage. > > > > New attributes "wwnn" and "wwpn" are introduced to indicate the > > LUN behind the vHBA. E.g. > > > > <disk type='block' device='disk'> > > <driver name='qemu' type='raw'/> > > <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/> > > If you change the schema of the <source> element, then you must > also create a new type='XXX' attribute to identify it, not just > re-use type='block' > > > <target dev='vda' bus='virtio'/> > > <address type='pci' domain='0x0000' bus='0x00' slot='0x07' > > function='0x0'/> > > </disk> > > > > Before the domain starting, we have to check if there is LUN > > assigned to the vHBA, error out if not. > > > > Using the stable path of LUN also works, e.g. > > > > <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/> > > > > But the disadvantage is the user have to figure out the stable > > path himself; And we have to do checking of every stable path to > > see if it's behind a vHBA in migration "Begin" stage. Or an new > > XML tag for element "source" to indicate that it's behind a vHBA? > > such as: > > > > <source dev="disk-by-path" model="vport"/> > > I don't much like the idea of mapping vHBA to <disk> elements, > because you have a cardinality mis-match. A <disk> is equivalent > of a single LUN, but a vHBA is something that provides multiple > LUNs. > > If you want to directly associate a vHBA with a virtual guest, > then this is really in the realm of SCSI HBA passthrough, not > <disk> devices. > > > If you want something mapped to the <disk> device, then the > approach should be to map to a storage pool volume - something > we've long talked about as broadly useful for all storage types, > not just NPIV. +1, we really should take this as an opportunity to add storage volumes as <disk> devices. > > 3) Migration with vHBA > > > > One possible solution for migration with vHBA is to use one pair > > of WWNN & WWPN on source host, one is using for domain, one is > > reserved for migration purpose. It requires the storage admin maps > > the same LUN to the two vHBAs when doing the masking and zoning. > > > > One of the two vHBA is called "Primary vHBA", another is called > > "secondary vHBA". To maitain the relationship between these two > > vHBAs, we have to introduce new XMLs to vHBA. E.g. > > > > In XML of primary vHBA: > > > > <secondary wwpn="2101001b32a90004"/> > > > > In XML of secondary vHBA: > > > > <primary wwpn="2101001b32a90002"/> > > > > Primary vHBA is going to be guaranteed not used by any domain which > > is driven by libvirt (we do some checking eariler before the domain > > starting). And it's also guaranteed that the LUN can't be used by > > other domain with sVirt or Sanlock. So it's safe to have two vHBAs > > on source host too. > > > > To prevent one using the LUN by creating vHBA using the same WWNN & > > WWPN on another host, we must create the secondary vHBA on source > > host, even it's not being used. > > > > Both primary and secondary vHBA must be defined and marked as > > "autostart" so that the domain could be started after system > > rebooting. > > > > When do migration, we have to bake a bigger cookie with secondary > > vHBA's info (basically it's WWNN and WWPN) in migration "Begin" > > stage, and eat that in migration "Prepare" stage on target host. > > > > In "Begin" stage, the XMLs represents the secondary vHBA is > > constructed. And the secondary vHBA is destoyed on source host, > > not undefined though. > > > > In "Prepare" stage, a new vHBA is created (define and start) > > on target host with the same WWNN & WWPN as secondary vHBA on > > source host. The LUN then should be visible to target host > > automatically? and thus migration can be performed. After migration > > is finished on target host, the primary vHBA on source host is > > destroyed, not undefined. > > > > If migration fails, the new vHBA created on target host will > > be destroyed and undefined. And both primary and secondary > > vHBA on source host will be started, so that the domain could > > be resumed. > > > > Finally if migration succeeds, primary vHBA on source host > > will be transtered to target host as secondary vHBA (defined). > > And both primary and secondary vHBA on source host will be > > undefined. > > If we do the mapping of HBAs to guest domains using storage > pools, then at a guest level, migration requires zero work. > > It is simply upto the management app to create the storage > pool on the destination host with the same Name + UUID, but > with the secondary WWNN/WWPN. The nice thing about this, is > that you don't need to hardcode details of a secondary > WWNN/WWPN up-front. The management app can just decide on > those at the time it performs the migration, so 99% of the > time there will only need to be a single vHBA setup on the > SAN. During migration the mgmt app can setup a second > vHBA for the target host, and once complete, delete the > original vHBA entirely. Agreed, although there will of course need to be some degree of up-front coordination between the management app and the SAN administrators to avoid having to involve them to migrate a VM. > > 4) Enrich HBA's XML > > > > It's hard to known the vHBAs created from a HBA with current > > implementation. One have to dump XML of each (v)HBAs and find > > out the clue with element "parent" of vHBAs. It's good to introduce > > new element for HBA like "vports", so that one can easily known > > what (how many) vHBAs are created from the HBA? > > > > And also it's good to have the maximum vports the HBA supports. > > > > Except these, other useful information should be exposed too, > > such as the vendor name, the HBA state, PCI address, etc. > > > > The new XMLs should be like: > > > > <vports num='2' max='64'> > > <vport name="scsi_host40" wwpn="2101001b32a90004"/> > > <vport name="scsi_host40" wwpn="2101001b32a90005"/> > > </vports> > > <online/> > > <vendor>QLogic</vendor> > > <address type="pci" domain="0" bus="0" slot="5" function="0"/> > > > > "online", "vendor", "address" make sense to vHBA too. > > I'm trying to remember how we modelled the parent/child relationship > for SR-IOV PCI cards. NPIV is a very similar concept, so we should > ideally seek to model the parent/child relationship in the same > manner. Physical function: <device> <name>pci_0000_01_00_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igb</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>0</slot> <function>0</function> <product id='0x10c9'>82576 Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='virt_functions'> <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/> </capability> </capability> </device> Virtual function: <device> <name>pci_0000_01_10_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igbvf</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>16</slot> <function>0</function> <product id='0x10ca'>82576 Virtual Function</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='phys_function'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </capability> <capability type='virt_functions'> </capability> </capability> </device> Interesingly, I think there's a bug there; the VF should not be showing <capability type='virt_functions'> but that's unrelated to the present discussion. Dave > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list