Thanks. You are right. I found a potential bug, and as I understand it, the code only applies to the Aldebaran GPU and I can not check the correctness of the code. I only test code on my navi 10 and run GPU stress tests. My knowledge of amdgpu is limited, and fixing potential bugs allows me to learn more about amdgpu code. But there are many that I still don't understand. In any case, we need to fix the code to eliminate problems in the future. Regards, Grigory. вт, 5 апр. 2022 г. в 20:00, Felix Kuehling <felix.kuehling@xxxxxxx>: > > Am 2022-04-04 um 18:21 schrieb Grigory Vasilyev: > > In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function, > > the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links. > > > > In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced > > without any checks: peer_adev->gmc.xgmi.node_id . > > What's worse, peer_adev is uninitialized with an undefined value if src > is NULL. So that code was definitely bogus. > > However, I think your patch will result in incorrect results. Currently > amdgpu_amdkfd_get_xgmi_bandwidth is always called with is_min=true if > src==NULL, and with is_min=false if src!=NULL. The intention is, that we > assume a single XGMI link in the case that src==NULL. That means the > is_min parameter is redundant. What we should do instead is, assume that > num_links==1 if src==NULL, and drop the is_min parameter. > > That would keep things working the way they do now, and prevent > potential problems in the future. > > Regards, > Felix > > > > > > Signed-off-by: Grigory Vasilyev <h0tc0d3@xxxxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > > index be1a55f8b8c5..1a1006b18016 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > > @@ -541,11 +541,10 @@ int amdgpu_amdkfd_get_xgmi_bandwidth_mbytes(struct amdgpu_device *dst, > > struct amdgpu_device *adev = dst, *peer_adev; > > int num_links; > > > > - if (adev->asic_type != CHIP_ALDEBARAN) > > + if (!src || adev->asic_type != CHIP_ALDEBARAN) > > return 0; > > > > - if (src) > > - peer_adev = src; > > + peer_adev = src; > > > > /* num links returns 0 for indirect peers since indirect route is unknown. */ > > num_links = is_min ? 1 : amdgpu_xgmi_get_num_links(adev, peer_adev);