On Tue, Apr 26, 2022 at 10:41:01AM +0000, Tian, Kevin wrote: > That's one case of incompatibility, but the IOMMU attach group callback > can fail in a variety of ways. One that we've seen that is not > uncommon is that we might have an mdev container with various mappings > to other devices. None of those mappings are validated until the mdev > driver tries to pin something, where it's generally unlikely that > they'd pin those particular mappings. Then QEMU hot-adds a regular > IOMMU backed device, we allocate a domain for the device and replay the > mappings from the container, but now they get validated and potentially > fail. The kernel returns a failure for the SET_IOMMU ioctl, QEMU > creates a new container and fills it from the same AddressSpace, where > now QEMU can determine which mappings can be safely skipped. I think it is strange that the allowed DMA a guest can do depends on the order how devices are plugged into the guest, and varys from device to device? IMHO it would be nicer if qemu would be able to read the new reserved regions and unmap the conflicts before hot plugging the new device. We don't have a kernel API to do this, maybe we should have one? > A: > QEMU sets up a MemoryListener for the device AddressSpace and attempts > to map anything that triggers that listener, which includes not only VM > RAM which is our primary mapping goal, but also miscellaneous devices, > unaligned regions, and other device regions, ex. BARs. Some of these > we filter out in QEMU with broad generalizations that unaligned ranges > aren't anything we can deal with, but other device regions covers > anything that's mmap'd in QEMU, ie. it has an associated KVM memory > slot. IIRC, in the case I'm thinking of, the mapping that triggered > the replay failure was the BAR for an mdev device. No attempt was made > to use gup or PFNMAP to resolve the mapping when only the mdev device > was present and the mdev host driver didn't attempt to pin pages within > its own BAR, but neither of these methods worked for the replay (I > don't recall further specifics). This feels sort of like a bug in iommufd, or perhaps qemu.. With iommufd only normal GUP'able memory should be passed to map. Special memory will have to go through some other API. This is different from vfio containers. We could possibly check the VMAs in iommufd during map to enforce normal memory.. However I'm also a bit surprised that qemu can't ID the underlying memory source and avoid this? eg currently I see the log messages that it is passing P2P BAR memory into iommufd map, this should be prevented inside qemu because it is not reliable right now if iommufd will correctly reject it. IMHO multi-container should be avoided because it does force creating multiple iommu_domains which does have a memory/performance cost. Though, it is not so important that it is urgent (and copy makes it work better anyhow), qemu can stay as it is. Jason