On Sat, 3 Sep 2016 22:01:13 +0530 Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > On 9/3/2016 1:59 AM, John Ferlan wrote: > > > > > > On 09/02/2016 02:33 PM, Kirti Wankhede wrote: > >> > >> On 9/2/2016 10:55 PM, Paolo Bonzini wrote: > >>> > >>> > >>> On 02/09/2016 19:15, Kirti Wankhede wrote: > >>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote: > >>>>> <device> > >>>>> <name>my-vgpu</name> > >>>>> <parent>pci_0000_86_00_0</parent> > >>>>> <capability type='mdev'> > >>>>> <type id='11'/> > >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > >>>>> </capability> > >>>>> </device> > >>>>> > >>>>> After creating the vGPU, if required by the host driver, all the other > >>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too. > >>>> > >>>> Thanks Paolo for details. > >>>> 'nodedev-create' parse the xml file and accordingly write to 'create' > >>>> file in sysfs to create mdev device. Right? > >>>> At this moment, does libvirt know which VM this device would be > >>>> associated with? > >>> > >>> No, the VM will associate to the nodedev through the UUID. The nodedev > >>> is created separately from the VM. > >>> > >>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete > >>>>> info, again taken from sysfs: > >>>>> > >>>>> <device> > >>>>> <name>my-vgpu</name> > >>>>> <parent>pci_0000_86_00_0</parent> > >>>>> <capability type='mdev'> > >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > >>>>> <!-- only the chosen type --> > >>>>> <type id='11'> > >>>>> <!-- ... snip ... --> > >>>>> </type> > >>>>> <capability type='pci'> > >>>>> <!-- no domain/bus/slot/function of course --> > >>>>> <!-- could show whatever PCI IDs are seen by the guest: --> > >>>>> <product id='...'>...</product> > >>>>> <vendor id='0x10de'>NVIDIA</vendor> > >>>>> </capability> > >>>>> </capability> > >>>>> </device> > >>>>> > >>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have > >>>>> pci at all, would have it inside mdev. This represents the difference > >>>>> between the mdev provider and the mdev device. > >>>> > >>>> Parent of mdev device might not always be a PCI device. I think we > >>>> shouldn't consider it as PCI capability. > >>> > >>> The <capability type='pci'> in the vGPU means that it _will_ be exposed > >>> as a PCI device by VFIO. > >>> > >>> The <capability type='pci'> in the physical GPU means that the GPU is a > >>> PCI device. > >>> > >> > >> Ok. Got that. > >> > >>>>> Random proposal for the domain XML too: > >>>>> > >>>>> <hostdev mode='subsystem' type='pci'> > >>>>> <source type='mdev'> > >>>>> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> > >>>>> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > >>>>> </source> > >>>>> <address type='pci' bus='0' slot='2' function='0'/> > >>>>> </hostdev> > >>>>> > >>>> > >>>> When user wants to assign two mdev devices to one VM, user have to add > >>>> such two entries or group the two devices in one entry? > >>> > >>> Two entries, one per UUID, each with its own PCI address in the guest. > >>> > >>>> On other mail thread with same subject we are thinking of creating group > >>>> of mdev devices to assign multiple mdev devices to one VM. > >>> > >>> What is the advantage in managing mdev groups? (Sorry didn't follow the > >>> other thread). > >>> > >> > >> When mdev device is created, resources from physical device is assigned > >> to this device. But resources are committed only when device goes > >> 'online' ('start' in v6 patch) > >> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources > >> for all vGPU devices in a VM are committed at one place. So we need to > >> know the vGPUs assigned to a VM before QEMU starts. > >> > >> Grouping would help here as Alex suggested in that mail. Pulling only > >> that part of discussion here: > >> > >> <Alex> It seems then that the grouping needs to affect the iommu group > >> so that > >>> you know that there's only a single owner for all the mdev devices > >>> within the group. IIRC, the bus drivers don't have any visibility > >>> to opening and releasing of the group itself to trigger the > >>> online/offline, but they can track opening of the device file > >>> descriptors within the group. Within the VFIO API the user cannot > >>> access the device without the device file descriptor, so a "first > >>> device opened" and "last device closed" trigger would provide the > >>> trigger points you need. Some sort of new sysfs interface would need > >>> to be invented to allow this sort of manipulation. > >>> Also we should probably keep sight of whether we feel this is > >>> sufficiently necessary for the complexity. If we can get by with only > >>> doing this grouping at creation time then we could define the "create" > >>> interface in various ways. For example: > >>> > >>> echo $UUID0 > create > >>> > >>> would create a single mdev named $UUID0 in it's own group. > >>> > >>> echo {$UUID0,$UUID1} > create > >>> > >>> could create mdev devices $UUID0 and $UUID1 grouped together. > >>> > >> </Alex> > >> > >> <Kirti> > >> I think this would create mdev device of same type on same parent > >> device. We need to consider the case of multiple mdev devices of > >> different types and with different parents to be grouped together. > >> </Kirti> > >> > >> <Alex> We could even do: > >>> > >>> echo $UUID1:$GROUPA > create > >>> > >>> where $GROUPA is the group ID of a previously created mdev device into > >>> which $UUID1 is to be created and added to the same group. > >> </Alex> > >> > >> <Kirti> > >> I was thinking about: > >> > >> echo $UUID0 > create > >> > >> would create mdev device > >> > >> echo $UUID0 > /sys/class/mdev/create_group > >> > >> would add created device to group. > >> > >> For multiple devices case: > >> echo $UUID0 > create > >> echo $UUID1 > create > >> > >> would create mdev devices which could be of different types and > >> different parents. > >> echo $UUID0, $UUID1 > /sys/class/mdev/create_group > >> > >> would add devices in a group. > >> Mdev core module would create a new group with unique number. On mdev > >> device 'destroy' that mdev device would be removed from the group. When > >> there are no devices left in the group, group would be deleted. With > >> this "first device opened" and "last device closed" trigger can be used > >> to commit resources. > >> Then libvirt use mdev device path to pass as argument to QEMU, same as > >> it does for VFIO. Libvirt don't have to care about group number. > >> </Kirti> > >> > > > > The more complicated one makes this, the more difficult it is for the > > customer to configure and the more difficult it is and the longer it > > takes to get something out. I didn't follow the details of groups... > > > > What gets created from a pass through some *mdev/create_group? > > My proposal here is, on > echo $UUID1, $UUID2 > /sys/class/mdev/create_group > would create a group in mdev core driver, which should be internal to > mdev core module. In mdev core module, a unique group number would be > saved in mdev_device structure for each device belonging to a that group. See my reply to the other thread, the group is an iommu group because that's the unit of ownership vfio uses. We're not going to impose an mdev specific layer of grouping on vfio. iommu group IDs are allocated by the iommu-core, we don't get to specify them. Also note the complication I've discovered with all devices within a group requiring the same iommu context, which maps poorly to the multiple device iommu contexts required to support a guest iommu. That's certainly not something we'd want to impose on mdev devices in the general case. > > Does > > some new udev device get create that then is fed to the guest? > > No, group is not a device. It will be like a identifier for the use of > vendor driver to identify devices in a group. > > > Seems > > painful to make two distinct/async passes through systemd/udev. I > > foresee testing nightmares with creating 3 vGPU's, processing a group > > request, while some other process/thread is deleting a vGPU... How do > > the vGPU's get marked so that the delete cannot happen. > > > > How is the same case handled for direct assigned device? I mean a device > is unbound from its vendors driver, bound to vfio_pci device. How is it > guaranteed to be assigned to vfio_pci module? some other process/thread > might unbound it from vfio_pci module? Yeah, I don't really see the problem here. Once an mdev device is bound to the mdev driver and opened by the user, the mdev driver release callback would be required in order to do the unbind. If we're concerned about multiple entities playing in sysfs at the same time creating and deleting devices and stepping on each other, well that's why we're using uuids for the device names and why we'd get group numbers from the iommu-core so that we have unique devices/groups and why we establish the parent-child relationship between mdev device and parent so we can't have orphan devices. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html