Hi, This is a draft solution for supporting multiple vSMMU instances in a qemu VM. Based on discussions/suggestions received for a previous RFC by Nicolin here[0], the association of vSMMUs to VFIO devices in VM PCIe topology should be moved out of qemu into libvirt. In addition, the nested SMMU nodes should be passed to qemu as pluggable devices. To address these changes, this patch series introduces a new "nestedSmmuv3" IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3 IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will each be attached to a separate PXB controller, and VFIO devices will be routed to PXBs based on their association with host SMMU nodes. This will maintain a VM PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu patch series in [0] and Shameer's work in [1] to remove VM topology changes from qemu and allow the nested SMMUs to be specified as pluggable devices. For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for passthrough: <devices> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> </hostdev> <iommu model='nestedSmmuv3'/> </devices> Libvirt will scan sysfs and populate the VM definition with controllers and nestedSmmuv3 devices based on host config. So if /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node represented by /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000 and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller. Then the hostdev will be routed to a PXB controller that has a matching host SMMU node associated with it: <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='251'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='3' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='249'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0x8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </controller> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> <iommu model='nestedSmmuv3'/> <nestedSmmuv3> <name>smmu3.0x0000000012000000</name> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000016000000</name> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000011000000</name> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </nestedSmmuv3> <iommu model='nestedSmmuv3'/> </devices> TODO: - No DMA mapping can found by UEFI when specifying multiple passthrough devices in the VM definition, and VM boot is subsequently blocked. We need to investigate this for the next revision, but we don't encounter this issue when passing through a single device. We'll include iommufd support in the next revision to narrow down whether the required fix would be outside of libvirt. - Shameer's qemu branch specifies nestedSmmuv3 bus number with "pci-bus" instead of "bus", so the libvirt compilation test args and qemu args in qemuBuildPCINestedSmmuv3DevProps() need to be modified to match this revision of qemu. It will be reverted to using "bus" in the next qemu revision. - This patchset decrements PXB busNr based on how many devices are attached downstream, and the libvirt documentation states we must reserve busNr for the PXB itself in addition to any devices attached downstream. When I launch a VM and a PXB has a pcie-root-port and hostdev attached downstream, busNrs 253, 252, and 251 are reserved. But the PXB itself already has a bus number assigned via the <address/> attribute, and I see 253 and 252 assigned to the hostdev and pcie-root-port in the VM but not 251. Should we decrement busNr based on libvirt documentation or do we only need two busNrs 253 and 252 in the example here? This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/nested-smmuv3-12-05-24 Thanks, Nathan [0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@xxxxxxxxxx/ [1] https://lore.kernel.org/qemu-devel/20241108125242.60136-1-shameerali.kolothum.thodi@xxxxxxxxxx/ Signed-off-by: Nathan Chen <nathanc@xxxxxxxxxx> Nathan Chen (5): conf: Add a nestedSmmuv3 IOMMU model qemu: Implement and auto-add a nestedSmmuv3 device type qemu: Create PXBs and auto-assign VFIO devs and nested SMMUs qemu: Update PXB busNr for nestedSmmuv3 controllers qemu: Add test case for specifying multiple nested SMMUs docs/formatdomain.rst | 25 ++- src/ch/ch_domain.c | 1 + src/conf/domain_addr.c | 26 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 188 +++++++++++++++++ src/conf/domain_conf.h | 15 ++ src/conf/domain_postparse.c | 1 + src/conf/domain_validate.c | 24 +++ src/conf/schemas/domaincommon.rng | 17 ++ src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/lxc/lxc_driver.c | 6 + src/qemu/qemu_command.c | 64 +++++- src/qemu/qemu_command.h | 4 + src/qemu/qemu_domain.c | 2 + src/qemu/qemu_domain_address.c | 193 ++++++++++++++++++ src/qemu/qemu_driver.c | 3 + src/qemu/qemu_hotplug.c | 5 + src/qemu/qemu_postparse.c | 1 + src/qemu/qemu_validate.c | 16 ++ src/test/test_driver.c | 4 + tests/meson.build | 1 + .../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++ .../iommu-nestedsmmuv3.aarch64-latest.xml | 61 ++++++ tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++ tests/qemuxmlconftest.c | 4 +- tests/schemas/device.rng.in | 1 + tests/virnestedsmmuv3mock.c | 57 ++++++ 28 files changed, 788 insertions(+), 6 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml create mode 100644 tests/virnestedsmmuv3mock.c -- 2.34.1