On 9/3/2016 1:59 AM, John Ferlan wrote: > > > On 09/02/2016 02:33 PM, Kirti Wankhede wrote: >> >> On 9/2/2016 10:55 PM, Paolo Bonzini wrote: >>> >>> >>> On 02/09/2016 19:15, Kirti Wankhede wrote: >>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote: >>>>> <device> >>>>> <name>my-vgpu</name> >>>>> <parent>pci_0000_86_00_0</parent> >>>>> <capability type='mdev'> >>>>> <type id='11'/> >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> >>>>> </capability> >>>>> </device> >>>>> >>>>> After creating the vGPU, if required by the host driver, all the other >>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too. >>>> >>>> Thanks Paolo for details. >>>> 'nodedev-create' parse the xml file and accordingly write to 'create' >>>> file in sysfs to create mdev device. Right? >>>> At this moment, does libvirt know which VM this device would be >>>> associated with? >>> >>> No, the VM will associate to the nodedev through the UUID. The nodedev >>> is created separately from the VM. >>> >>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete >>>>> info, again taken from sysfs: >>>>> >>>>> <device> >>>>> <name>my-vgpu</name> >>>>> <parent>pci_0000_86_00_0</parent> >>>>> <capability type='mdev'> >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> >>>>> <!-- only the chosen type --> >>>>> <type id='11'> >>>>> <!-- ... snip ... --> >>>>> </type> >>>>> <capability type='pci'> >>>>> <!-- no domain/bus/slot/function of course --> >>>>> <!-- could show whatever PCI IDs are seen by the guest: --> >>>>> <product id='...'>...</product> >>>>> <vendor id='0x10de'>NVIDIA</vendor> >>>>> </capability> >>>>> </capability> >>>>> </device> >>>>> >>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have >>>>> pci at all, would have it inside mdev. This represents the difference >>>>> between the mdev provider and the mdev device. >>>> >>>> Parent of mdev device might not always be a PCI device. I think we >>>> shouldn't consider it as PCI capability. >>> >>> The <capability type='pci'> in the vGPU means that it _will_ be exposed >>> as a PCI device by VFIO. >>> >>> The <capability type='pci'> in the physical GPU means that the GPU is a >>> PCI device. >>> >> >> Ok. Got that. >> >>>>> Random proposal for the domain XML too: >>>>> >>>>> <hostdev mode='subsystem' type='pci'> >>>>> <source type='mdev'> >>>>> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> >>>>> </source> >>>>> <address type='pci' bus='0' slot='2' function='0'/> >>>>> </hostdev> >>>>> >>>> >>>> When user wants to assign two mdev devices to one VM, user have to add >>>> such two entries or group the two devices in one entry? >>> >>> Two entries, one per UUID, each with its own PCI address in the guest. >>> >>>> On other mail thread with same subject we are thinking of creating group >>>> of mdev devices to assign multiple mdev devices to one VM. >>> >>> What is the advantage in managing mdev groups? (Sorry didn't follow the >>> other thread). >>> >> >> When mdev device is created, resources from physical device is assigned >> to this device. But resources are committed only when device goes >> 'online' ('start' in v6 patch) >> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources >> for all vGPU devices in a VM are committed at one place. So we need to >> know the vGPUs assigned to a VM before QEMU starts. >> >> Grouping would help here as Alex suggested in that mail. Pulling only >> that part of discussion here: >> >> <Alex> It seems then that the grouping needs to affect the iommu group >> so that >>> you know that there's only a single owner for all the mdev devices >>> within the group. IIRC, the bus drivers don't have any visibility >>> to opening and releasing of the group itself to trigger the >>> online/offline, but they can track opening of the device file >>> descriptors within the group. Within the VFIO API the user cannot >>> access the device without the device file descriptor, so a "first >>> device opened" and "last device closed" trigger would provide the >>> trigger points you need. Some sort of new sysfs interface would need >>> to be invented to allow this sort of manipulation. >>> Also we should probably keep sight of whether we feel this is >>> sufficiently necessary for the complexity. If we can get by with only >>> doing this grouping at creation time then we could define the "create" >>> interface in various ways. For example: >>> >>> echo $UUID0 > create >>> >>> would create a single mdev named $UUID0 in it's own group. >>> >>> echo {$UUID0,$UUID1} > create >>> >>> could create mdev devices $UUID0 and $UUID1 grouped together. >>> >> </Alex> >> >> <Kirti> >> I think this would create mdev device of same type on same parent >> device. We need to consider the case of multiple mdev devices of >> different types and with different parents to be grouped together. >> </Kirti> >> >> <Alex> We could even do: >>> >>> echo $UUID1:$GROUPA > create >>> >>> where $GROUPA is the group ID of a previously created mdev device into >>> which $UUID1 is to be created and added to the same group. >> </Alex> >> >> <Kirti> >> I was thinking about: >> >> echo $UUID0 > create >> >> would create mdev device >> >> echo $UUID0 > /sys/class/mdev/create_group >> >> would add created device to group. >> >> For multiple devices case: >> echo $UUID0 > create >> echo $UUID1 > create >> >> would create mdev devices which could be of different types and >> different parents. >> echo $UUID0, $UUID1 > /sys/class/mdev/create_group >> >> would add devices in a group. >> Mdev core module would create a new group with unique number. On mdev >> device 'destroy' that mdev device would be removed from the group. When >> there are no devices left in the group, group would be deleted. With >> this "first device opened" and "last device closed" trigger can be used >> to commit resources. >> Then libvirt use mdev device path to pass as argument to QEMU, same as >> it does for VFIO. Libvirt don't have to care about group number. >> </Kirti> >> > > The more complicated one makes this, the more difficult it is for the > customer to configure and the more difficult it is and the longer it > takes to get something out. I didn't follow the details of groups... > > What gets created from a pass through some *mdev/create_group? My proposal here is, on echo $UUID1, $UUID2 > /sys/class/mdev/create_group would create a group in mdev core driver, which should be internal to mdev core module. In mdev core module, a unique group number would be saved in mdev_device structure for each device belonging to a that group. > Does > some new udev device get create that then is fed to the guest? No, group is not a device. It will be like a identifier for the use of vendor driver to identify devices in a group. > Seems > painful to make two distinct/async passes through systemd/udev. I > foresee testing nightmares with creating 3 vGPU's, processing a group > request, while some other process/thread is deleting a vGPU... How do > the vGPU's get marked so that the delete cannot happen. > How is the same case handled for direct assigned device? I mean a device is unbound from its vendors driver, bound to vfio_pci device. How is it guaranteed to be assigned to vfio_pci module? some other process/thread might unbound it from vfio_pci module? > If a vendor wants to create their own utility to group vHBA's together > and manage that grouping, then have at it... Doesn't seem to be > something libvirt needs to be or should be managing... As I go running > for cover... > > If having multiple types generated for a single vGPU, then consider the > following XML: > > <capability type='mdev'> > <type id='11' [other attributes]/> > <type id='11' [other attributes]/> > <type id='12' [other attributes]/> > [<uuid>...</uuid>] > </capability> > > then perhaps building the mdev_create input would be a comma separated > list of type's to be added... "$UUID:11,11,12". Just a thought... > In that case the vGPUs are created on same physical GPUs. Consider the case two vGPUs on different physical devices need to be assigned to a VM. Then those should be two different create commands: echo $UUID0 > /sys/../<bdf1>/mdev_create echo $UUID1 > /sys/../<bdf2>/mdev_create Kirti. > > John > >> Thanks, >> Kirti >> >> -- >> libvir-list mailing list >> libvir-list@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/libvir-list >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html