Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

"Grodzovsky, Andrey" <Andrey.Grodzovsky@xxxxxxx> · Mon, 7 Jan 2019 17:20:28 +0000

On 01/07/2019 12:03 PM, StDenis, Tom wrote:
> On 2019-01-07 12:00 p.m., Grodzovsky, Andrey wrote:
>>
>> On 01/07/2019 11:53 AM, StDenis, Tom wrote:
>>> On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
>>>> On 01/07/2019 11:36 AM, StDenis, Tom wrote:
>>>>> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>>>>>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>>>>>>> I think it's reasonable to use the hive  specific lock for hive  specific functions.
>>>>>>> The changes is acked-by  Shaoyun.liu < Shaoyun.liu@xxxxxxx>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of StDenis, Tom
>>>>>>> Sent: Monday, January 7, 2019 10:16 AM
>>>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Cc: StDenis, Tom <Tom.StDenis@xxxxxxx>
>>>>>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>>>>>>
>>>>>>> v2: Move locks around in other functions so that this function can stand on its own.  Also only hold the hive specific lock for add/remove device instead of the driver global lock so you can't add/remove devices in parallel from one hive.
>>>>>>>
>>>>>>> Signed-off-by: Tom St Denis <tom.stdenis@xxxxxxx>
>>>>>>> ---
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++++++++++++++--------
>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>>>>>>         3 files changed, 25 insertions(+), 15 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>         	 * by different nodes. No point also since the one node already executing
>>>>>>>         	 * reset will also reset all the other nodes in the hive.
>>>>>>>         	 */
>>>>>>> -	hive = amdgpu_get_xgmi_hive(adev);
>>>>>>> +	hive = amdgpu_get_xgmi_hive(adev, 0);
>>>>>>>         	if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>>>>>>         	    !mutex_trylock(&hive->hive_lock))
>>>>>>>         		return 0;
>>>>>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>>>>>> the same time device 1 is being added to same have though
>>>>>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>>>>>> being added and so gpu reset for device 0 bails out on
>>>>>> '!mutex_trylock(&hive->hive_lock))' without completing the reset.
>>>>>> Also in general i feel a bit uncomfortable about the confusing
>>>>>> acquisition scheme in the function and  the fact that you take the
>>>>>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>>>>>> of the function.
>>>>> Is adding a device while resetting a device even a valid operation
>>>>> anyways?
>>>> In theory it's valid if you have hot pluggable devices
>>>>> I think this means more so that the reset logic is broken.  Instead
>>>>> there should be a per-hive reset lock that is taken and that is tested
>>>>> instead.
>>>>>
>>>>> Tom
>>>> The hive->hive_lock was added exactly for this purpose and used only for
>>>> that purpose. Maybe the naming i gave it wasn't reflective of it's
>>>> purpose :)
>>> But the add/remove should use per-hive locks not the global lock... :-)
>>>
>>> (I'm honestly not trying to bike shed I just thought the get_hive
>>> function looked wrong :-)).
>>>
>>> Tom
>> Totally agree with you, if Shayun (who origianlly added the global
>> xgmi_mutex) is fine with switching to per hive mutex then me too, I just
>> point out the problem with gpu reset and as you said we then need to
>> rename the existing hive_lock into reset_lock and then and another per
>> hive lock to do what you propose. Also - is there a way to not take the
>> hive lock inside amdgpu_get_xgmi_hive but release it outside ? AFAIK
>> it's an opening for problems where people use it but forget to call
>> release.
> I wanted to take the per-hive lock inside get_hive because it also takes
> the global lock so that add/remove couldn't happen in parallel.
>
> For instance, deleting the last node while adding a new node means the
> per-hive mutex could be in limbo (because remove will delete the lock).
>
> Adding a per-hive reset lock would fix the remaining issues no?
>
> Tom

Looks like it.

Andrey

> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx