On Wed, 12 Apr 2023 10:09:32 +0000 "Liu, Yi L" <yi.l.liu@xxxxxxxxx> wrote: > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Sent: Wednesday, April 12, 2023 8:01 AM > > > > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote: > > > > > > Management tools already need to understand dev_set if they want to > > > > offer reliable reset support to the VMs. Same as today. > > > > > > I don't think that's true. Our primary hot-reset use case is GPUs and > > > subordinate functions, where the isolation and reset scope are often > > > sufficiently similar to make hot-reset possible, regardless whether > > > all the functions are assigned to a VM. I don't think you'll find any > > > management tools that takes reset scope into account otherwise. > > > > When I think of "reliable reset support" I think of the management > > tool offering a checkbox that says "ensure PCI function reset > > availability" and if checked it will not launch the VM without a > > working reset. > > > > If the user configures a set of VFIO devices and then hopes they get > > working reset, that is fine, but doesn't require any reporting of > > reset groups, or iommu groups to the management layer to work. > > > > > > > As I understand the proposal, QEMU now gets to attempt to > > > > > claim ownership of the dev_set, so it opportunistically extends its > > > > > ownership and may block other users from the affected devices. > > > > > > > > We can decide the policy for the kernel to accept a claim. I suggested > > > > below "same as today" - it must hold all the groups within the > > > > iommufd_ctx. > > > > > > It must hold all the groups [that the user doesn't know about because > > > it's not a formal part of the cdev API] within the iommufd_ctx? > > > > You keep going back to this, but I maintain userspace doesn't > > care. qemu is given a list of VFIO devices to use, all it wants to > > know is if it is allowed to use reset or not. Why should it need to > > know groups and group_ids to get that binary signal out of the kernel? > > > > > > The simplest option for no-iommu is to require it to pass in every > > > > device fd to the reset ioctl. > > > > > > Which ironically is exactly how it ends up working today, each no-iommu > > > device has a fake IOMMU group, so every affected device (group) needs > > > to be provided. > > > > Sure, that is probably the way forward for no-iommu. Not that anyone > > uses it.. > > > > The kicker is we don't force the user to generate a de-duplicated list > > of devices FDs, one per group, just because. > > > > > > I want to re-focus on the basics of what cdev is supposed to be doing, > > > > because several of the idea you suggested seem against this direction: > > > > > > > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > > > > by compiling all the vfio_group infrastructure out. iommu_groups > > > > continue to exist. > > > > > > > > So converting a cdev to a vfio_group is not an allowed operation. > > > > > > My only statements in this respect were towards the notion that IOMMU > > > groups continue to exist. I'm well aware of the desire to deprecate > > > and remove vfio groups. > > > > Yes > > > > > > - no-iommu should not have iommu_groups. We enforce this by compiling > > > > out all the no-iommu vfio_group infrastructure. > > > > > > This is not logically inferred from the above if IOMMU groups continue > > > to exist and continue to be a basis for describing DMA ownership as > > > well as "reset groups" > > > > It is not ment to flow out of the above, it is a seperate statement. I > > want the iommu_group mechanism to stop being abused outside the iommu > > core code. The only thing that should be creating groups is an > > attached iommu driver operating under ops->device_group(). > > > > VFIO needed this to support mdev and no-iommu. We already have mdev > > free of iommu_groups, I would like no-iommu to also be free of it too, > > we are very close. > > > > That would leave POWER as the only abuser of the > > iommu_group_add_device() API, and it is only doing it because it > > hasn't got a proper iommu driver implementation yet. It turns out > > their abuse is mislocked and maybe racy to boot :( > > > > > > - cdev APIs should ideally not require the user to know the group_id, > > > > we should try hard to design APIs to avoid this. > > > > > > This is a nuance, group_id vs group, where it's been previously > > > discussed that users will need to continue to know the boundaries of a > > > group for the purpose of DMA isolation and potentially IOAS > > > independence should cdev/iommufd choose to tackle those topics. > > > > Yes, group_id is a value we have no specific use for and would require > > userspace to keep seperate track of. I'd prefer to rely on dev_id as > > much as possible instead. > > > > > What is the actual proposal here? > > > > I don't know anymore, you don't seem to like this direction either... > > > > > You've said that hot-reset works if the iommufd_ctx has > > > representation from each affected group, the INFO ioctl remains as > > > it is, which suggests that it's reporting group ID and BDF, yet only > > > sysfs tells the user the relation between a vfio cdev and a group > > > and we're trying to enable a pass-by-fd model for cdev where the > > > user has no reference to a sysfs node for the device. Show me how > > > these pieces fit together. > > > > I prefer the version where INFO2 returns the dev_id, but info can work > > if we do the BDF cap like you suggested to Yi > > > > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio > > > device has an IOMMU group > > > > I don't desire every VFIO device to have an iommu_group. I want VFIO > > devices with real IOMMU drivers to have an iommu_group. mdev and > > no-iommu should not. I don't want to add them back into the design > > just so INFO has a value to return. > > > > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an > > iommu_group... > > > > I see this problem as a few basic requirements from a qemu-like > > application: > > > > 1) Does the configuration I was given support reset right now? > > 2) Will the configuration I was given support reset for the duration > > of my execution? > > 3) What groups of the devices I already have open does the reset > > effect? > > 4) For debugging, report to the user the full list of devices in the > > reset group, in a way that relates back to sysfs. > > 5) Away to trigger a reset on a group of devices > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > configuration works, and ask it to keep it working. > > > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > I hope we can have a clear statement on the _INFO or INFO2 usage. > Today, per QEMU's implementation, the output of _INFO is used to: > > 1) do a self-check to see if all the affected groups are opened by the > current user before it can invoke hot-reset. > 2) figure out the devices that are already opened by the user. QEMU > needs to save the state of such devices as the device may already > been in use. If so, its state should be saved and restored prior/post > the hot-reset. > > Seems like we are relaxing the self-check as it may be done by locking > the reset group. is it? I hope not. Locking the reset group suggests the user is able to extend their ownership. IMO we should not allow that. Thanks, Alex