Am 24.01.22 um 18:00 schrieb Sharma, Shashank:
On 1/24/2022 5:48 PM, Christian König wrote:
Am 24.01.22 um 17:45 schrieb Sharma, Shashank:
Hello Christian,
Thank for your comments, please fine mine inline:
On 1/24/2022 8:15 AM, Christian König wrote:
Am 21.01.22 um 21:34 schrieb Sharma, Shashank:
From 1c5c552eeddaffd9fb3e7d45ece1b2b28fccc575 Mon Sep 17 00:00:00
2001
From: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx>
Date: Fri, 21 Jan 2022 14:19:10 +0530
Subject: [PATCH 3/4] drm/amdgpu: add reset register trace dump
function for
gfx_v10_0
Implementation of register trace dump function on the AMD GPU resets
Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 ++++
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 53
++++++++++++++++++++++-
drivers/gpu/drm/amd/include/amd_shared.h | 1 +
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index d855cb53c7e0..c97b53b54333 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,14 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
__entry->seqno)
);
+TRACE_EVENT(gfx_v10_0_reset_reg_dumps,
+ TP_PROTO(char *reg_dumps),
+ TP_ARGS(reg_dumps),
+ TP_STRUCT__entry(__string(dumps, reg_dumps)),
+ TP_fast_assign(__assign_str(dumps, reg_dumps);),
+ TP_printk("amdgpu register dump {%s}", __get_str(dumps))
+);
+
#undef AMDGPU_JOB_GET_TIMELINE_NAME
#endif
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 16dbe593cba2..05974ed5416d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -46,7 +46,7 @@
#include "v10_structs.h"
#include "gfx_v10_0.h"
#include "nbio_v2_3.h"
-
+#include "amdgpu_trace.h"
/*
* Navi10 has two graphic rings to share each graphic pipe.
* 1. Primary ring
@@ -188,6 +188,12 @@
#define RLCG_ERROR_REPORT_ENABLED(adev) \
(amdgpu_sriov_reg_indirect_mmhub(adev) ||
amdgpu_sriov_reg_indirect_gc(adev))
+#define N_REGS (17)
+#define DUMP_REG(addr) do { \
+ (dump)[i][0] = (addr); \
+ (dump)[i++][1] = RREG32_SOC15_IP(GC, addr); \
+ } while (0)
Those macros need an AMDGPU_ prefix and maybe some better names.
Agree, @Amar pls check this out.
From the design POV I'm really wondering if it wouldn't be better
if userspace defines the registers we want to dump in case of a crash.
I am not so sure about it. This means we have to communicate with
the userspace and get the list of registers it wishes to see, do I
smell a new IOCTL or DRM property ? I also feel that we have better
filtering tools for a trace event in userspace than in kernel.
What do you think ?
Just a writeable debugfs file should probably do it. We have a list
of known registers filled in during boot (just reg offsets) and
userspace can read/write the file to update it.
Ok, so in this case, what would be our state machine ?
1. gpu_reset happened
2. uevent_sent from work queue
3. userspace goes and writes into debugfs files
4. kernel collects the desired register values
5 (a). kernel send the trace_event with desired values
6 (a). kernel sends another uevent to indicate trace_event
Or
5 (a+b). kernel loads the reg values via debugfs itself.
In both of these case, wouldn't it be too late to collect register
values by the time userspace list arrives at kernel ?
Or do you have a better state machine in mind ?
1. system boots, kernel loads a default register list to dump.
2. Userspace optionally updates this register list.
....
3. GPU reset happens, we dump the registers into the trace log.
4. We send udev event to signal the GPU reset to umr or other tool to
save and store the trace log.
Regards,
Christian.