[Public] Messed up James' email in Tested-by tag. CC'ing James. > -----Original Message----- > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx> > Sent: Wednesday, October 16, 2024 11:59 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Kim, > Jonathan <Jonathan.Kim@xxxxxxx>; James Yao <yiqing@xxxxxxxxxxx> > Subject: [PATCH] drm/amdkfd: sever xgmi io link if host driver has disable sharing > > From: Jonathan Kim <Jonathan.Kim@xxxxxxx> > > Host drivers can create partial hives per guest by disabling xgmi sharing > between certain peers in the main hive. > Typically, these partial hives are fully connected per guest session. > In the event that the host makes a mistake by adding a non-shared node > to a guest session, have the KFD reflect sharing disabled by severing > the IO link. > > Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx> > Tested-by: James Yao <yiqing@xxxxxxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 17 +++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 2 ++ > drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 3 +++ > 3 files changed, 22 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > index fcdbcff57632..1d50f327eb08 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > @@ -801,6 +801,23 @@ int amdgpu_xgmi_get_num_links(struct amdgpu_device > *adev, > return -EINVAL; > } > > +bool amdgpu_xgmi_get_is_sharing_enabled(struct amdgpu_device *adev, > + struct amdgpu_device *peer_adev) > +{ > + struct psp_xgmi_topology_info *top = &adev->psp.xgmi_context.top_info; > + int i; > + > + /* Sharing should always be enabled for non-SRIOV. */ > + if (!amdgpu_sriov_vf(adev)) > + return true; > + > + for (i = 0 ; i < top->num_nodes; ++i) > + if (top->nodes[i].node_id == peer_adev->gmc.xgmi.node_id) > + return !!top->nodes[i].is_sharing_enabled; > + > + return false; > +} > + > /* > * Devices that support extended data require the entire hive to initialize with > * the shared memory buffer flag set. > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h > index 41d5f97fc77a..8cc7ab38db7c 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h > @@ -66,6 +66,8 @@ int amdgpu_xgmi_get_hops_count(struct amdgpu_device > *adev, > struct amdgpu_device *peer_adev); > int amdgpu_xgmi_get_num_links(struct amdgpu_device *adev, > struct amdgpu_device *peer_adev); > +bool amdgpu_xgmi_get_is_sharing_enabled(struct amdgpu_device *adev, > + struct amdgpu_device *peer_adev); > uint64_t amdgpu_xgmi_get_relative_phy_addr(struct amdgpu_device *adev, > uint64_t addr); > static inline bool amdgpu_xgmi_same_hive(struct amdgpu_device *adev, > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > index 48caecf7e72e..723f1220e1cc 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > @@ -28,6 +28,7 @@ > #include "kfd_topology.h" > #include "amdgpu.h" > #include "amdgpu_amdkfd.h" > +#include "amdgpu_xgmi.h" > > /* GPU Processor ID base for dGPUs for which VCRAT needs to be created. > * GPU processor ID are expressed with Bit[31]=1. > @@ -2329,6 +2330,8 @@ static int kfd_create_vcrat_image_gpu(void *pcrat_image, > continue; > if (peer_dev->gpu->kfd->hive_id != kdev->kfd->hive_id) > continue; > + if (!amdgpu_xgmi_get_is_sharing_enabled(kdev->adev, > peer_dev->gpu->adev)) > + continue; > sub_type_hdr = (typeof(sub_type_hdr))( > (char *)sub_type_hdr + > sizeof(struct crat_subtype_iolink)); > -- > 2.34.1