Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.

"Grodzovsky, Andrey" <Andrey.Grodzovsky@xxxxxxx> · Thu, 12 Sep 2019 15:19:17 +0000

RAS error will be triggered if the HW identified a faulty physical page, 
the error comes through an interrupt where the data payload will have 
information that can be translated into the bad page address, we then as 
a recovery measure reset the ASIC, reserve this bad page so it cannot be 
used by anyone else and store it's address to EEPROM for future 
reservation after boot.

Andrey

On 9/12/19 11:15 AM, Christian König wrote:
> Well I still hope to avoid VRAM lost in most of the cases, but that is 
> really not guaranteed.
>
> What is the bad page and why do you need to reserve it?
>
> Christian.
>
> Am 12.09.19 um 16:09 schrieb Grodzovsky, Andrey:
>> I am not sure VRAM loss happens every time, but when it does I would
>> assume you would have to reserve them again as the page tables content
>> was lost. On the other hand I do remember we keep shadow system memory
>> copies of all page tables so maybe that not an issue, so yes, just try
>> to allocate the bad page after reset and if it's still reserved you will
>> fail.
>>
>> Andrey
>>
>> On 9/12/19 7:35 AM, Zhou1, Tao wrote:
>>> Hi Andrey:
>>>
>>> Are you sure of the VRAM content loss after gpu reset? I'm not very 
>>> familiar with the detail of gpu reset and I'll do experiment to 
>>> confirm the case you mentioned.
>>>
>>> Regards,
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>>>> Sent: 2019年9月12日 10:32
>>>> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Zhou1, Tao
>>>> <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> That not what I meant. Let's say you handled one bad page interrupt 
>>>> and as
>>>> a result have one bad page reserved. Now unrelated gfx ring timeout
>>>> happens which triggers GPU reset and VRAM loss. When you come back 
>>>> from
>>>> reset amdgpu_ras_reserve_bad_pages will be called but since 
>>>> last_reserved
>>>> == data_count the bad page will not be reserved again, maybe we 
>>>> should just
>>>> set data->last_reserved to 0 again if VRAM was lost during ASIC 
>>>> reset...
>>>>
>>>> Andrey
>>>>
>>>> ________________________________________
>>>> From: Chen, Guchun <Guchun.Chen@xxxxxxx>
>>>> Sent: 11 September 2019 21:53:03
>>>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Deucher, Alexander
>>>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> Comment inline.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>>>> Sent: Wednesday, September 11, 2019 10:41 PM
>>>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Chen, Guchun <Guchun.Chen@xxxxxxx>; Deucher, Alexander
>>>> <Alexander.Deucher@xxxxxxx>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> On second though this will break  what about reserving bad pages when
>>>> resetting GPU for non RAS error reason such as manual reset ,S3 or 
>>>> ring
>>>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
>>>> code as is.
>>>>
>>>> Another possible issue in existing code - looks like no reservation 
>>>> will take
>>>> place in those case even now as amdgpu_ras_reserve_bad_pages
>>>> data->last_reserved will be equal to data->count , no ? Looks like for
>>>> this case you need to add flag to FORCE reservation for all pages from
>>>> 0 to data->counnt.
>>>> [Guchun]Yes, last_reserved is not updated any more, unless we 
>>>> unload our
>>>> driver. So it maybe always equal to data->count, then no new bad 
>>>> page will
>>>> be reserved.
>>>> I see we have one eeprom reset by user, can we put this 
>>>> last_reserved clean
>>>> operation to user in the same stack as well?
>>>>
>>>> Andrey
>>>>
>>>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
>>>>> I like this much more, I will relocate to
>>>>> amdgpu_umc_process_ras_data_cb an push.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>>>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>>>>>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>>>>>> Anyway, either way is OK and the patch is:
>>>>>>
>>>>>> Reviewed-by: Tao Zhou <tao.zhou1@xxxxxxx>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>>>>>>> Sent: 2019年9月11日 3:41
>>>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Cc: Chen, Guchun <Guchun.Chen@xxxxxxx>; Zhou1, Tao
>>>>>>> <Tao.Zhou1@xxxxxxx>; Deucher, Alexander
>>>> <Alexander.Deucher@xxxxxxx>;
>>>>>>> Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>>>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>>>>
>>>>>>> Problem:
>>>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>>>>>> because writing to EEPROM during ASIC reset was unstable.
>>>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>>>>>> directly from ISR context and so locking is not allowed. Also it's
>>>>>>> irrelevant for this partilcular interrupt as this is generic RAS
>>>>>>> interrupt and not memory errors specific.
>>>>>>>
>>>>>>> Fix:
>>>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> index 012034d..dd5da3c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>         /* save bad page to eeprom before gpu reset,
>>>>>>>          * i2c may be unstable in gpu reset
>>>>>>>          */
>>>>>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> +    if (in_task())
>>>>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> +
>>>>>>>         if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>>>>             schedule_work(&ras->recovery_work);
>>>>>>>         return 0;
>>>>>>> -- 
>>>>>>> 2.7.4
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx