On 09/02/2016 06:05 AM, Paolo Bonzini wrote: > > > On 02/09/2016 07:21, Kirti Wankhede wrote: >> On 9/2/2016 10:18 AM, Michal Privoznik wrote: >>> Okay, maybe I'm misunderstanding something. I just thought that users >>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh >>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info >>> to construct domain XML. >> >> I'm not familiar with libvirt code, curious how libvirt's nodedev driver >> enumerates devices in the system? > > It looks at sysfs and/or the udev database and transforms what it finds > there to XML. Caveat: I started writing this in the morning... Of course the email thread has evolved even more since then... If you have libvirt installed, use 'virsh nodedev-list --tree' to get a tree format of what libvirt "finds". But to answer the question, it's mostly a brute force method of perusing the sysfs trees that libvirt cares about and storing away the data in nodedev driver objects. As/when new devices are found there's a udev create device event that libvirtd follows in order to generate a new nodedev object for devices that libvirt cares about. Similarly there's a udev delete device event to remove devices. FWIW: Some examples of nodedev output can be found at: http://libvirt.org/formatnode.html > > I think people would consult the nodedev driver to fetch vGPU > capabilities, use "virsh nodedev-create" to create the vGPU device on > the host, and then somehow refer to the nodedev in the domain XML. > > There isn't very much documentation on nodedev-create, but it's used > mostly for NPIV (virtual fibre channel adapter) and the XML looks like this: > > <device> > <name>scsi_host6</name> > <parent>scsi_host5</parent> > <capability type='scsi_host'> > <capability type='fc_host'> > <wwnn>2001001b32a9da5e</wwnn> > <wwpn>2101001b32a9da5e</wwpn> > </capability> > </capability> > </device> > The above is the nodedev-dumpxml of the created NPIV (a/k/a vHBA) node device - although there's also a "<fabric_wwn>" now too. One can also look at http://wiki.libvirt.org/page/NPIV_in_libvirt to get a practical example of vHBA creation. The libvirt wiki data was more elegantly transposed into RHEL7 docs at: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-NPIV_storage.html The nodedev-create sole purpose is vHBA creation - the API was introduced in 0.6.5 (commit id '81d0ffbc'). Without going into a lot of detail - the API is WWNN/WWPN centric and relies on udev create device events (via udevEventHandleCallback) to add the scsi_hostM vHBA with the WWNN/WWPN. NB: There's a systemd/udev "lag" issue to make note of - the add event is generated before all the files are populated with correct values (https://bugzilla.redhat.com/show_bug.cgi?id=1210832). In order to work around that the nodedev-create logic scans the scsi_host devices to find the matching scsi_hostM. > so I suppose for vGPU it would look like this: > > <device> > <name>my-vgpu</name> > <parent>pci_0000_86_00_0</parent> > <capability type='mdev'> > <type id='11'/> > <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > </capability> > </device> So one question would be "where" does one find the value for the <uuid> field? From the initial libvirt RFC it seems as though a generated UUID is fine, but figured I'd ask just to be sure I'm not making any assumptions. Based on how the email thread is going - figuring out the input format to mdev_create needs to be agreed upon... Once that's done figuring out how to generate XML that can be used for the input should be simpler. In end, so far I've assumed there would be one vGPU referenced by a $UUID and perhaps a name... I have no idea what udev creates when mdev_create is called - is it only the /sys/bus/mdev/devices/$UUID? Or is there some new /sys/bus/pci/devices/$PCIADDR as well? FWIW: Hopefully it'll help to give the vHBA comparison. The minimal equivalent *pre* vHBA XML looks like: <device> <parent>scsi_host5</parent> <capability type='scsi_host'> <capability type='fc_host'> </capability> </capability> </device> This is fed into 'virsh nodedev-create $XMLFILE' and the result is the vHBA XML (e.g. the scsi_host6 output above). Providing a wwnn/wwpn is not necessary - if not provided they are generated. The wwnn/wwpn pair is fed to the "vport_create" (via echo "wwpn:wwnn" > vport_create), then udev takes over and creates a new scsi_hostM device (in the /sys/class/scsi_host directory just like the HBA) with a parent using the wwnn, wwpn. The nodedev-create code doesn't do the nodedev object creation - that's done automagically via udev add event processing. Once udev creates the device, it sends an event which the nodedev driver handles. Note that for nodedev-create, the <name> field is ignored. The reason it's ignored is because the logic knows udev will create one for us, e.g. scsi_host6 in the above XML based on running the vport_create from the parent HBA. In order to determine the <parent> field, one uses "virsh nodedev-list --caps vports" and chooses from the output one of the scsi_hostN's provided. That capability is determined during libvirtd node device db initialization by finding "/sys/class/fc_host/hostN/vport_create" files and setting a bit from which future searches can use the capability string. The resulting vHBA can be fed into XML for a 'scsi' storage pool and the LUN's for the vHBA will be listed once the pool is started via 'virsh vol-list $POOLNAME. Those LUN's can then be fed into guest XML as a 'disk' or passthru 'lun'. The format is on the wiki page. > > while the parent would have: > > <device> > <name>pci_0000_86_00_0</name> > <capability type='pci'> > <domain>0</domain> > <bus>134</bus> > <slot>0</slot> > <function>0</function> > <capability type='mdev'> > <!-- one type element per sysfs directory --> > <type id='11'> > <!-- one element per sysfs file roughly --> > <name>GRID M60-0B</name> > <attribute name='num_heads'>2</attribute> > <attribute name='frl_config'>45</attribute> > <attribute name='framebuffer'>524288</attribute> > <attribute name='hres'>2560</attribute> > <attribute name='vres'>1600</attribute> > </type> > </capability> > <product id='...'>GRID M60</product> > <vendor id='0x10de'>NVIDIA</vendor> > </capability> > </device> > I would consider this to be the starting point (GPU) that's needed to create vGPU's for libvirt. In order to find this needle in the haystack of PCI devices, code would need to be added to find the "/sys/bus/pci/devices/$PCIADDR/mdev_create" files during initial sysfs tree parsing, where $PCIADDR in this case is "0000:86:0.0". Someone doing this should search on VPORTS and VPORT_OPS in the libvirt code. Once a a new capability flag is added, it'll be easy to use "virsh nodedev-list mdevs" in order to get a list of pci_* devices which can support vGPU. >From that list, the above XML would be generated via "virsh nodedev-dumpxml pci_0000_86_00_0" (for example). Whatever one finds in that output I would expect to be used to feed into the XML that would need to be created to generate a vGPU via nodedev-create and thus become parameters to "mdev_create". Once the mdev_create is done, then watching /sys/bus/mdev/devices/ for the UUID would mimic how vHBA does things. So we got this far, but how do we ensure that subsequent reboots create the same vGPU's for guests? The vHBA code achieves this by creating a storage pool that creates the vHBA when the storage pool starts. That way when the guest starts it can reference the storage pool and unit. We don't have such a pool for GPU's (yet) - although I suppose they could just become a class of storage pools. The issue being nodedev device objects are not saved between reboots. They are generated on the fly. Hence the "create-nodedev' API - notice there's no "define-nodedev' API, although I suppose one could be created. It's just more work to get this all to work properly. > After creating the vGPU, if required by the host driver, all the other > type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too. > Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week. > When dumping the mdev with nodedev-dumpxml, it could show more complete > info, again taken from sysfs: > > <device> > <name>my-vgpu</name> > <parent>pci_0000_86_00_0</parent> > <capability type='mdev'> > <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > <!-- only the chosen type --> > <type id='11'> > <name>GRID M60-0B</name> > <attribute name='num_heads'>2</attribute> > <attribute name='frl_config'>45</attribute> > <attribute name='framebuffer'>524288</attribute> > <attribute name='hres'>2560</attribute> > <attribute name='vres'>1600</attribute> > </type> > <capability type='pci'> > <!-- no domain/bus/slot/function of course --> > <!-- could show whatever PCI IDs are seen by the guest: --> > <product id='...'>...</product> > <vendor id='0x10de'>NVIDIA</vendor> > </capability> > </capability> > </device> > > Notice how the parent has mdev inside pci; the vGPU, if it has to have > pci at all, would have it inside mdev. This represents the difference > between the mdev provider and the mdev device. > > Random proposal for the domain XML too: > > <hostdev mode='subsystem' type='pci'> > <source type='mdev'> > <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> > <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > </source> > <address type='pci' bus='0' slot='2' function='0'/> > </hostdev> > PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt! John -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html