On 01/07/2019 12:03 PM, StDenis, Tom wrote: > On 2019-01-07 12:00 p.m., Grodzovsky, Andrey wrote: >> >> On 01/07/2019 11:53 AM, StDenis, Tom wrote: >>> On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote: >>>> On 01/07/2019 11:36 AM, StDenis, Tom wrote: >>>>> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote: >>>>>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote: >>>>>>> I think it's reasonable to use the hive specific lock for hive specific functions. >>>>>>> The changes is acked-by Shaoyun.liu < Shaoyun.liu@xxxxxxx> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of StDenis, Tom >>>>>>> Sent: Monday, January 7, 2019 10:16 AM >>>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> Cc: StDenis, Tom <Tom.StDenis@xxxxxxx> >>>>>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2) >>>>>>> >>>>>>> v2: Move locks around in other functions so that this function can stand on its own. Also only hold the hive specific lock for add/remove device instead of the driver global lock so you can't add/remove devices in parallel from one hive. >>>>>>> >>>>>>> Signed-off-by: Tom St Denis <tom.stdenis@xxxxxxx> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 36 ++++++++++++++-------- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 2 +- >>>>>>> 3 files changed, 25 insertions(+), 15 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index 39d5d058b2c7..13d8e2ad2f7a 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>>>>> * by different nodes. No point also since the one node already executing >>>>>>> * reset will also reset all the other nodes in the hive. >>>>>>> */ >>>>>>> - hive = amdgpu_get_xgmi_hive(adev); >>>>>>> + hive = amdgpu_get_xgmi_hive(adev, 0); >>>>>>> if (hive && adev->gmc.xgmi.num_physical_nodes > 1 && >>>>>>> !mutex_trylock(&hive->hive_lock)) >>>>>>> return 0; >>>>>> Let's say i have device 0 in hive A and it just got a gpu reset and at >>>>>> the same time device 1 is being added to same have though >>>>>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device >>>>>> being added and so gpu reset for device 0 bails out on >>>>>> '!mutex_trylock(&hive->hive_lock))' without completing the reset. >>>>>> Also in general i feel a bit uncomfortable about the confusing >>>>>> acquisition scheme in the function and the fact that you take the >>>>>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside >>>>>> of the function. >>>>> Is adding a device while resetting a device even a valid operation >>>>> anyways? >>>> In theory it's valid if you have hot pluggable devices >>>>> I think this means more so that the reset logic is broken. Instead >>>>> there should be a per-hive reset lock that is taken and that is tested >>>>> instead. >>>>> >>>>> Tom >>>> The hive->hive_lock was added exactly for this purpose and used only for >>>> that purpose. Maybe the naming i gave it wasn't reflective of it's >>>> purpose :) >>> But the add/remove should use per-hive locks not the global lock... :-) >>> >>> (I'm honestly not trying to bike shed I just thought the get_hive >>> function looked wrong :-)). >>> >>> Tom >> Totally agree with you, if Shayun (who origianlly added the global >> xgmi_mutex) is fine with switching to per hive mutex then me too, I just >> point out the problem with gpu reset and as you said we then need to >> rename the existing hive_lock into reset_lock and then and another per >> hive lock to do what you propose. Also - is there a way to not take the >> hive lock inside amdgpu_get_xgmi_hive but release it outside ? AFAIK >> it's an opening for problems where people use it but forget to call >> release. > I wanted to take the per-hive lock inside get_hive because it also takes > the global lock so that add/remove couldn't happen in parallel. > > For instance, deleting the last node while adding a new node means the > per-hive mutex could be in limbo (because remove will delete the lock). > > Adding a per-hive reset lock would fix the remaining issues no? > > Tom Looks like it. Andrey > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx