Re: [PATCH V1] vfio: add attach_group_by_node to control behavior of attaching group to domain

Zhenguo Yao <yaozhenguo1@xxxxxxxxx> · Wed, 22 Nov 2023 09:02:13 +0800



Understood, thanks！

Alex Williamson <alex.williamson@xxxxxxxxxx> 于2023年11月18日周六 06:27写道：
>
> On Wed, 15 Nov 2023 10:02:09 +0800
> yaozhenguo <yaozhenguo1@xxxxxxxxx> wrote:
>
> > From: Zhenguo Yao <yaozhenguo1@xxxxxxxxx>
> >
> > Groups will attach to one iommu_domain if ops and enforce_cache_coherency
> > are equal. And all the iommu hardware share one pagetable by default.
> > There are performance issue in some scenarios. For example:
> > Host hardware topopy:
> >
> > node0 + PCIe RP0 ---+ GPU A100
> >       |         |---+ GPU A100
> >       |               |---+ NIC Mellanox CX6
> >       |               |---+ NIC Mellanox CX6
> >       + PCIe RP1 ---+ GPU A100
> >                 |---+ GPU A100
> >                       |---+ NIC Mellanox CX6
> >                 |---+ NIC Mellanox CX6
> > node1 + PCIe RP0 ---+ GPU A100
> >       |         |---+ GPU A100
> >       |               |---+ NIC Mellanox CX6
> >       |               |---+ NIC Mellanox CX6
> >       + PCIe RP1 ---+ GPU A100
> >                 |---+ GPU A100
> >                       |---+ NIC Mellanox CX6
> >                 |---+ NIC Mellanox CX6
> >
> > We passthrough all NICs and GPU to VM, and emulate host hardware topopy.
> > Mellanox CX6 ATS feature is enabled, GPU direct RDMA enabled.
> > We test NCCL allreduce in VM at different cases.
> >
> > Case1: allreduce test use 4nic and 4GPU in numa0.
> > Case2：allreduce test use 4nic and 4GPU in numa1.
> > case3: allreduce test use 8nic and 8GPU.
> >
> > the result are below:
> >
> > |        | algbw (GB/S) |
> > | ------ | -------------|
> > | case1  | 24           |
> > | case2  | 32           |
> > | case3  | 45           |
> >
> > We checked that IOMMU pagetable is allocated in numa1 when VM boot up.
> > So, if IOTLB miss happan, IOMMU hardware in numa0 will access remote
> > pagetable in numa1. This will drop performance. After apply this patch and
> > attach_group_by_node is 1. Group in same node will attach to one domain.
> > IOMMU will access there local pagetable. Performance is improved:
> >
> > |        | algbw (GB/S) |
> > | ------ | -------------|
> > | case1  | 32           |
> > | case2  | 32           |
> > | case3  | 63           |
> >
> > Signed-off-by: Zhenguo Yao <yaozhenguo1@xxxxxxxxx>
> > Co-developed-by: Wenchao Yao <yaowenchao@xxxxxx>
> > Signed-off-by: Wenchao Yao <yaowenchao@xxxxxx>
> > Co-developed-by: ZiHan Zhou <zhouzihan30@xxxxxx>
> > Signed-off-by: ZiHan Zhou <zhouzihan30@xxxxxx>
> > ---
> >  drivers/iommu/intel/iommu.c     |  8 +++++++-
> >  drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++------------
> >  include/linux/iommu.h           |  1 +
> >  3 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 3531b95..2c6d8f0 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -569,8 +569,10 @@ void domain_update_iommu_cap(struct dmar_domain *domain)
> >        * If RHSA is missing, we should default to the device numa domain
> >        * as fall back.
> >        */
> > -     if (domain->nid == NUMA_NO_NODE)
> > +     if (domain->nid == NUMA_NO_NODE) {
> >               domain->nid = domain_update_device_node(domain);
> > +             domain->domain.nid = domain->nid;
> > +     }
> >
> >       /*
> >        * First-level translation restricts the input-address to a
> > @@ -1767,6 +1769,7 @@ static struct dmar_domain *alloc_domain(unsigned int type)
> >               return NULL;
> >
> >       domain->nid = NUMA_NO_NODE;
> > +     domain->domain.nid = NUMA_NO_NODE;
> >       if (first_level_by_default(type))
> >               domain->use_first_level = true;
> >       domain->has_iotlb_device = false;
> > @@ -1808,6 +1811,8 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> >       info->refcnt    = 1;
> >       info->did       = num;
> >       info->iommu     = iommu;
> > +     domain->nid     = iommu->node;
> > +     domain->domain.nid     = iommu->node;
> >       curr = xa_cmpxchg(&domain->iommu_array, iommu->seq_id,
> >                         NULL, info, GFP_ATOMIC);
> >       if (curr) {
> > @@ -1837,6 +1842,7 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> >               clear_bit(info->did, iommu->domain_ids);
> >               xa_erase(&domain->iommu_array, iommu->seq_id);
> >               domain->nid = NUMA_NO_NODE;
> > +             domain->domain.nid = NUMA_NO_NODE;
> >               domain_update_iommu_cap(domain);
> >               kfree(info);
> >       }
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index eacd6ec..6a5641e 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -59,6 +59,11 @@
> >  module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
> >  MODULE_PARM_DESC(dma_entry_limit,
> >                "Maximum number of user DMA mappings per container (65535).");
> > +static uint attach_group_by_node;
> > +module_param_named(attach_group_by_node,
> > +             attach_group_by_node, uint, 0644);
> > +MODULE_PARM_DESC(attach_group_by_node,
> > +              "Attach group to domain when it's in same node");
> >
> >  struct vfio_iommu {
> >       struct list_head        domain_list;
> > @@ -2287,19 +2292,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >               if (d->domain->ops == domain->domain->ops &&
> >                   d->enforce_cache_coherency ==
> >                           domain->enforce_cache_coherency) {
> > -                     iommu_detach_group(domain->domain, group->iommu_group);
> > -                     if (!iommu_attach_group(d->domain,
> > -                                             group->iommu_group)) {
> > -                             list_add(&group->next, &d->group_list);
> > -                             iommu_domain_free(domain->domain);
> > -                             kfree(domain);
> > -                             goto done;
> > -                     }
> > +                     if ((attach_group_by_node == 1 &&
> > +                             d->domain->nid == domain->domain->nid) ||
> > +                             attach_group_by_node == 0) {
> > +                             iommu_detach_group(domain->domain, group->iommu_group);
> > +                             if (!iommu_attach_group(d->domain,
> > +                                                     group->iommu_group)) {
> > +                                     list_add(&group->next, &d->group_list);
> > +                                     iommu_domain_free(domain->domain);
> > +                                     kfree(domain);
> > +                                     goto done;
> > +                             }
> >
> > -                     ret = iommu_attach_group(domain->domain,
> > -                                              group->iommu_group);
> > -                     if (ret)
> > -                             goto out_domain;
> > +                             ret = iommu_attach_group(domain->domain,
> > +                                             group->iommu_group);
> > +                             if (ret)
> > +                                     goto out_domain;
> > +                     }
> >               }
> >       }
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index ec289c1..c1330ed 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -123,6 +123,7 @@ struct iommu_domain {
> >                       int users;
> >               };
> >       };
> > +     int nid;
> >  };
> >
> >  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>
> As I understand what's being done here, we're duplicating
> dmar_domain.nid to iommu_domain.nid, then when enabled by this new
> module option, we'll use this node id as part of the match to determine
> whether to create a new domain within the same container context or
> re-use an existing domain, which may have non-favorably locality.
>
> If we're going to implement a node id on the iommu_domain, it should
> replace the existing use of node id in the device specific structure
> and not simply duplicate it.  This should also account for non-VT-d use
> cases as well, for example AMD IOMMU also has a nid field on their
> protection_domain structure.  Alternatively this might be implemented
> through iommu_domain_ops so we could query the node association for a
> domain.
>
> I question whether we need this solution at all though.  AIUI the
> initial domain is allocated in proximity to the initial group.  The
> problem comes when the user asks to add an additional group into the
> same container.  Another valid solution would be that the user
> recognizes that these groups are not within the same locality and
> creates a separate container for this group.  In fact, if we're using
> QEMU here and created a q35 VM with vIOMMU, each device would have a
> separate address space and therefore a separate container and we'd
> already avoid the issue this patch tries to solve.
>
> Separate containers per QEMU AddressSpace are a requirement, but QEMU
> might also implement a policy to not re-use vfio containers between
> virtual nodes such that if each locality were mapped to separate PXBs
> with unique proximities, then simply reflecting the physical locality
> into the VM would be sufficient to avoid this non-optimal domain
> allocation placement.
>
> In any case, the type1 vfio IOMMU backend is in the early stages of
> deprecation, so any choices we make here would also need to be reflected
> in IOMMUFD, both in the compatibility and native interfaces.  Thanks,
>
> Alex
>