RE: [PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Tue, 9 Jul 2024 06:12:59 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

The series is:

Reviewed-by: Tao Zhou <tao.zhou1@xxxxxxx>

> -----Original Message-----
> From: Chai, Thomas <YiPeng.Chai@xxxxxxx>
> Sent: Tuesday, July 9, 2024 1:56 PM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao
> <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, Yang(Kevin)
> <KevinYang.Wang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai,
> Thomas <YiPeng.Chai@xxxxxxx>
> Subject: [PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu
> ras reset is completed
>
> The problem case is as follows:
> 1. GPU A triggers a gpu ras reset, and GPU A drives
>    GPU B to also perform a gpu ras reset.
> 2. After gpu B ras reset started, gpu B queried a DE
>    data. Since the DE data was queried in the ras reset
>    thread instead of the page retirement thread, bad
>    page retirement work would not be triggered. Then
>    even if all gpu resets are completed, the bad pages
>    will be cached in RAM until GPU B's bad page retirement
>    work is triggered again and then saved to eeprom.
>
> This patch can save the bad pages to eeprom in time after gpu ras reset is
> completed.
>
> v2:
>   1. Add the above description to code comments.
>   2. Reuse existing function.
>
> Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  6 +++++-
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 18 ++++++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d923151af752..34226ae010c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2864,8 +2864,12 @@ static void amdgpu_ras_do_page_retirement(struct
> work_struct *work)
>       struct ras_err_data err_data;
>       unsigned long err_cnt;
>
> -     if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
> +     /* If gpu reset is ongoing, delay retiring the bad pages */
> +     if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
> +             amdgpu_ras_schedule_retirement_dwork(con,
> +                             AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
>               return;
> +     }
>
>       amdgpu_ras_error_data_init(&err_data);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0faa21d8a7b4..9dbb13adb661 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -29,6 +29,7 @@
>  #include "mp/mp_13_0_6_sh_mask.h"
>
>  #define MAX_ECC_NUM_PER_RETIREMENT  32
> +#define DELAYED_TIME_FOR_GPU_RESET  1000  //ms
>
>  static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
>                                           uint32_t node_inst,
> @@ -568,6 +569,23 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>
>       con->umc_ecc_log.de_queried_count++;
>
> +     /* The problem case is as follows:
> +      * 1. GPU A triggers a gpu ras reset, and GPU A drives
> +      *    GPU B to also perform a gpu ras reset.
> +      * 2. After gpu B ras reset started, gpu B queried a DE
> +      *    data. Since the DE data was queried in the ras reset
> +      *    thread instead of the page retirement thread, bad
> +      *    page retirement work would not be triggered. Then
> +      *    even if all gpu resets are completed, the bad pages
> +      *    will be cached in RAM until GPU B's bad page retirement
> +      *    work is triggered again and then saved to eeprom.
> +      * Trigger delayed work to save the bad pages to eeprom in time
> +      * after gpu ras reset is completed.
> +      */
> +     if (amdgpu_ras_in_recovery(adev))
> +             schedule_delayed_work(&con->page_retirement_dwork,
> +                     msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET));
> +
>       return 0;
>  }
>
> --
> 2.34.1