> From: Liu, Yi L <yi.l.liu@xxxxxxxxx> > Sent: Tuesday, April 26, 2022 5:55 PM > On 2022/4/22 22:58, Jason Gunthorpe wrote: > > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote: > > > >> + > >> + /* try to attach to an existing container in this space */ > >> + QLIST_FOREACH(bcontainer, &space->containers, next) { > >> + if (!object_dynamic_cast(OBJECT(bcontainer), > >> + TYPE_VFIO_IOMMUFD_CONTAINER)) { > >> + continue; > >> + } > >> + container = container_of(bcontainer, VFIOIOMMUFDContainer, obj); > >> + if (vfio_device_attach_container(vbasedev, container, &err)) { > >> + const char *msg = error_get_pretty(err); > >> + > >> + trace_vfio_iommufd_fail_attach_existing_container(msg); > >> + error_free(err); > >> + err = NULL; > >> + } else { > >> + ret = vfio_ram_block_discard_disable(true); > >> + if (ret) { > >> + vfio_device_detach_container(vbasedev, container, &err); > >> + error_propagate(errp, err); > >> + vfio_put_address_space(space); > >> + close(vbasedev->fd); > >> + error_prepend(errp, > >> + "Cannot set discarding of RAM broken (%d)", ret); > >> + return ret; > >> + } > >> + goto out; > >> + } > >> + } > > > > ?? this logic shouldn't be necessary, a single ioas always supports > > all devices, userspace should never need to juggle multiple ioas's > > unless it wants to have different address maps. > > legacy vfio container needs to allocate multiple containers in some cases. > Say a device is attached to a container and some iova were mapped on this > container. When trying to attach another device to this container, it will > be failed in case of conflicts between the mapped DMA mappings and the > reserved iovas of the another device. For such case, legacy vfio chooses to > create a new container and attach the group to this new container. Hotlplug > is a typical case of such scenario. > Alex provided a clear rationale when we chatted with him on the same topic. I simply copied it here instead of trying to further translate: (Alex, please chime in if you want to add more words. 😊) Q: Why existing VFIOAddressSpace has a VFIOContainer list? is it because one device with type1 and another with no_iommu? A: That's one case of incompatibility, but the IOMMU attach group callback can fail in a variety of ways. One that we've seen that is not uncommon is that we might have an mdev container with various mappings to other devices. None of those mappings are validated until the mdev driver tries to pin something, where it's generally unlikely that they'd pin those particular mappings. Then QEMU hot-adds a regular IOMMU backed device, we allocate a domain for the device and replay the mappings from the container, but now they get validated and potentially fail. The kernel returns a failure for the SET_IOMMU ioctl, QEMU creates a new container and fills it from the same AddressSpace, where now QEMU can determine which mappings can be safely skipped. Q: I didn't get why some mappings are valid for one device while can be skipped for another device under the same address space. Can you elaborate a bit? If the skipped mappings are redundant and won't be used for dma why does userspace request it in the first place? I'm a bit lost here... A: QEMU sets up a MemoryListener for the device AddressSpace and attempts to map anything that triggers that listener, which includes not only VM RAM which is our primary mapping goal, but also miscellaneous devices, unaligned regions, and other device regions, ex. BARs. Some of these we filter out in QEMU with broad generalizations that unaligned ranges aren't anything we can deal with, but other device regions covers anything that's mmap'd in QEMU, ie. it has an associated KVM memory slot. IIRC, in the case I'm thinking of, the mapping that triggered the replay failure was the BAR for an mdev device. No attempt was made to use gup or PFNMAP to resolve the mapping when only the mdev device was present and the mdev host driver didn't attempt to pin pages within its own BAR, but neither of these methods worked for the replay (I don't recall further specifics). QEMU always attempts to create p2p mappings for devices, but this is a case where we don't halt the VM if such a mapping cannot be created, so a new container would replay the AddressSpace, see the fault, and skip the region. Q: If there is conflict between reserved regions of a newly-plugged device and existing mappings of VFIOAddressSpace, the device should simply be rejected from attaching to the address space instead of creating another container under that address space. A: >From a kernel perspective, yes, and that's what we do. That doesn't preclude the user from instantiating a new container and determining for themselves whether the reserved region conflict is critical. Note that just because containers are in the same AddressSpace doesn't mean that there aren't rules to allow certain mappings failures to be non-fatal. Thanks Kevin