RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

Additional:
This idx/count can help us to distinguish whether those logs are from the same query results.

If this information is removed, we cannot confirm from the log whether any log is missing.
e.g: [4/8], then we can know that there are 4 more that are not printed due to some reason.
 
Best Regards,
Kevin
 
_____________________________________________
From: Wang, Yang(Kevin)
Sent: Wednesday, January 3, 2024 5:51 PM
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support
 
_____________________________________________
From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Sent: Wednesday, January 3, 2024 5:36 PM
To: Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support
 
 
[AMD Official Use Only - General]

+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, bank->regs[aca_regs[i].reg_idx]);
 
We should keep the ACA log format simple since there are tools like crash dumper that grab these logs.
 
How about formatting log as below
dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture (ACA) events logged\n");
dev_info(adev→dev, "[Hardware error] ACA.%s=0x%016llx\n");
 
In general, if the idx doesn't convey useful information, then just replace it with ACA.Reg.
 
[Kevin]:
 
Agree, will update it in next version.
 
Best Regards,
Kevin
 
Thoughts?
 
Regards,
Hawking
 
-----Original Message-----
From: Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>
Sent: Wednesday, January 3, 2024 16:02
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>
Subject: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support
 
add ACA kernel hardware error log support.
 
Signed-off-by: Yang Wang <kevinyang.wang@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 29 +++++++++++++++++++++++++
1 file changed, 29 insertions(+)
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 6a6f167b5380..cadeda64eded 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -100,6 +100,33 @@ static int aca_smu_get_valid_aca_count(struct amdgpu_device *adev, enum aca_erro
         return smu_funcs->get_valid_aca_count(adev, type, count);  }
 
+static struct aca_regs_dump {
+       const char *name;
+       int reg_idx;
+} aca_regs[] = {
+       {"CONTROL",             ACA_REG_IDX_CTL},
+       {"STATUS",              ACA_REG_IDX_STATUS},
+       {"ADDR",                ACA_REG_IDX_ADDR},
+       {"MISC",                ACA_REG_IDX_MISC0},
+       {"CONFIG",              ACA_REG_IDX_CONFG},
+       {"IPID",                ACA_REG_IDX_IPID},
+       {"SYND",                ACA_REG_IDX_SYND},
+       {"DESTAT",              ACA_REG_IDX_DESTAT},
+       {"DEADDR",              ACA_REG_IDX_DEADDR},
+       {"CONTROL_MASK",        ACA_REG_IDX_CTL_MASK},
+};
+
+static void aca_smu_bank_dump(struct amdgpu_device *adev, int idx, int
+total, struct aca_bank *bank) {
+       int i;
+
+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, bank->regs[aca_regs[i].reg_idx]);
+}
+
static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_error_type type,
                                        int start, int count,
                                        struct aca_banks *banks)
@@ -137,6 +164,8 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_erro
                 if (ret)
                         return ret;
 
+               aca_smu_bank_dump(adev, i, count, &bank);
+
                 ret = aca_banks_add_bank(banks, &bank);
                 if (ret)
                         return ret;
--
2.34.1
 

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux