Understood, thanks! Alex Williamson <alex.williamson@xxxxxxxxxx> 于2023年11月18日周六 06:27写道: > > On Wed, 15 Nov 2023 10:02:09 +0800 > yaozhenguo <yaozhenguo1@xxxxxxxxx> wrote: > > > From: Zhenguo Yao <yaozhenguo1@xxxxxxxxx> > > > > Groups will attach to one iommu_domain if ops and enforce_cache_coherency > > are equal. And all the iommu hardware share one pagetable by default. > > There are performance issue in some scenarios. For example: > > Host hardware topopy: > > > > node0 + PCIe RP0 ---+ GPU A100 > > | |---+ GPU A100 > > | |---+ NIC Mellanox CX6 > > | |---+ NIC Mellanox CX6 > > + PCIe RP1 ---+ GPU A100 > > |---+ GPU A100 > > |---+ NIC Mellanox CX6 > > |---+ NIC Mellanox CX6 > > node1 + PCIe RP0 ---+ GPU A100 > > | |---+ GPU A100 > > | |---+ NIC Mellanox CX6 > > | |---+ NIC Mellanox CX6 > > + PCIe RP1 ---+ GPU A100 > > |---+ GPU A100 > > |---+ NIC Mellanox CX6 > > |---+ NIC Mellanox CX6 > > > > We passthrough all NICs and GPU to VM, and emulate host hardware topopy. > > Mellanox CX6 ATS feature is enabled, GPU direct RDMA enabled. > > We test NCCL allreduce in VM at different cases. > > > > Case1: allreduce test use 4nic and 4GPU in numa0. > > Case2:allreduce test use 4nic and 4GPU in numa1. > > case3: allreduce test use 8nic and 8GPU. > > > > the result are below: > > > > | | algbw (GB/S) | > > | ------ | -------------| > > | case1 | 24 | > > | case2 | 32 | > > | case3 | 45 | > > > > We checked that IOMMU pagetable is allocated in numa1 when VM boot up. > > So, if IOTLB miss happan, IOMMU hardware in numa0 will access remote > > pagetable in numa1. This will drop performance. After apply this patch and > > attach_group_by_node is 1. Group in same node will attach to one domain. > > IOMMU will access there local pagetable. Performance is improved: > > > > | | algbw (GB/S) | > > | ------ | -------------| > > | case1 | 32 | > > | case2 | 32 | > > | case3 | 63 | > > > > Signed-off-by: Zhenguo Yao <yaozhenguo1@xxxxxxxxx> > > Co-developed-by: Wenchao Yao <yaowenchao@xxxxxx> > > Signed-off-by: Wenchao Yao <yaowenchao@xxxxxx> > > Co-developed-by: ZiHan Zhou <zhouzihan30@xxxxxx> > > Signed-off-by: ZiHan Zhou <zhouzihan30@xxxxxx> > > --- > > drivers/iommu/intel/iommu.c | 8 +++++++- > > drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++------------ > > include/linux/iommu.h | 1 + > > 3 files changed, 29 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > > index 3531b95..2c6d8f0 100644 > > --- a/drivers/iommu/intel/iommu.c > > +++ b/drivers/iommu/intel/iommu.c > > @@ -569,8 +569,10 @@ void domain_update_iommu_cap(struct dmar_domain *domain) > > * If RHSA is missing, we should default to the device numa domain > > * as fall back. > > */ > > - if (domain->nid == NUMA_NO_NODE) > > + if (domain->nid == NUMA_NO_NODE) { > > domain->nid = domain_update_device_node(domain); > > + domain->domain.nid = domain->nid; > > + } > > > > /* > > * First-level translation restricts the input-address to a > > @@ -1767,6 +1769,7 @@ static struct dmar_domain *alloc_domain(unsigned int type) > > return NULL; > > > > domain->nid = NUMA_NO_NODE; > > + domain->domain.nid = NUMA_NO_NODE; > > if (first_level_by_default(type)) > > domain->use_first_level = true; > > domain->has_iotlb_device = false; > > @@ -1808,6 +1811,8 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) > > info->refcnt = 1; > > info->did = num; > > info->iommu = iommu; > > + domain->nid = iommu->node; > > + domain->domain.nid = iommu->node; > > curr = xa_cmpxchg(&domain->iommu_array, iommu->seq_id, > > NULL, info, GFP_ATOMIC); > > if (curr) { > > @@ -1837,6 +1842,7 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) > > clear_bit(info->did, iommu->domain_ids); > > xa_erase(&domain->iommu_array, iommu->seq_id); > > domain->nid = NUMA_NO_NODE; > > + domain->domain.nid = NUMA_NO_NODE; > > domain_update_iommu_cap(domain); > > kfree(info); > > } > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index eacd6ec..6a5641e 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -59,6 +59,11 @@ > > module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644); > > MODULE_PARM_DESC(dma_entry_limit, > > "Maximum number of user DMA mappings per container (65535)."); > > +static uint attach_group_by_node; > > +module_param_named(attach_group_by_node, > > + attach_group_by_node, uint, 0644); > > +MODULE_PARM_DESC(attach_group_by_node, > > + "Attach group to domain when it's in same node"); > > > > struct vfio_iommu { > > struct list_head domain_list; > > @@ -2287,19 +2292,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > > if (d->domain->ops == domain->domain->ops && > > d->enforce_cache_coherency == > > domain->enforce_cache_coherency) { > > - iommu_detach_group(domain->domain, group->iommu_group); > > - if (!iommu_attach_group(d->domain, > > - group->iommu_group)) { > > - list_add(&group->next, &d->group_list); > > - iommu_domain_free(domain->domain); > > - kfree(domain); > > - goto done; > > - } > > + if ((attach_group_by_node == 1 && > > + d->domain->nid == domain->domain->nid) || > > + attach_group_by_node == 0) { > > + iommu_detach_group(domain->domain, group->iommu_group); > > + if (!iommu_attach_group(d->domain, > > + group->iommu_group)) { > > + list_add(&group->next, &d->group_list); > > + iommu_domain_free(domain->domain); > > + kfree(domain); > > + goto done; > > + } > > > > - ret = iommu_attach_group(domain->domain, > > - group->iommu_group); > > - if (ret) > > - goto out_domain; > > + ret = iommu_attach_group(domain->domain, > > + group->iommu_group); > > + if (ret) > > + goto out_domain; > > + } > > } > > } > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > > index ec289c1..c1330ed 100644 > > --- a/include/linux/iommu.h > > +++ b/include/linux/iommu.h > > @@ -123,6 +123,7 @@ struct iommu_domain { > > int users; > > }; > > }; > > + int nid; > > }; > > > > static inline bool iommu_is_dma_domain(struct iommu_domain *domain) > > As I understand what's being done here, we're duplicating > dmar_domain.nid to iommu_domain.nid, then when enabled by this new > module option, we'll use this node id as part of the match to determine > whether to create a new domain within the same container context or > re-use an existing domain, which may have non-favorably locality. > > If we're going to implement a node id on the iommu_domain, it should > replace the existing use of node id in the device specific structure > and not simply duplicate it. This should also account for non-VT-d use > cases as well, for example AMD IOMMU also has a nid field on their > protection_domain structure. Alternatively this might be implemented > through iommu_domain_ops so we could query the node association for a > domain. > > I question whether we need this solution at all though. AIUI the > initial domain is allocated in proximity to the initial group. The > problem comes when the user asks to add an additional group into the > same container. Another valid solution would be that the user > recognizes that these groups are not within the same locality and > creates a separate container for this group. In fact, if we're using > QEMU here and created a q35 VM with vIOMMU, each device would have a > separate address space and therefore a separate container and we'd > already avoid the issue this patch tries to solve. > > Separate containers per QEMU AddressSpace are a requirement, but QEMU > might also implement a policy to not re-use vfio containers between > virtual nodes such that if each locality were mapped to separate PXBs > with unique proximities, then simply reflecting the physical locality > into the VM would be sufficient to avoid this non-optimal domain > allocation placement. > > In any case, the type1 vfio IOMMU backend is in the early stages of > deprecation, so any choices we make here would also need to be reflected > in IOMMUFD, both in the compatibility and native interfaces. Thanks, > > Alex >