Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

"Sharma, Shashank" <shashank.sharma@xxxxxxx> · Fri, 4 Feb 2022 18:07:11 +0100

On 2/4/2022 6:02 PM, Lazar, Lijo wrote:
[Public]

The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?


Again, that is contextual analysis capability, which needs to be 
embedded in the reader app. Even if we filter out the S3/S4 resets in 
the kernel, the situation remains the same, isn't it ?

- Shashank

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@xxxxxxx>
Sent: Friday, February 4, 2022 10:29 PM
To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram, Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler



On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
[AMD Official Use Only]

To explain more -
	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?

Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.


If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.

For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.

This could eventually be just difference in design philosophy maybe :)

- Shashank

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma@xxxxxxx>
Sent: Friday, February 4, 2022 10:09 PM
To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram,
Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Hey Lijo,
I somehow missed to respond on this comment, pls find inline:

Regards
Shashank

On 1/22/2022 7:42 AM, Lazar, Lijo wrote:


On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
   From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
2001
From: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx>
Date: Fri, 21 Jan 2022 14:19:42 +0530
Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

This patch adds a GPU reset handler for Navi ASIC family, which
typically dumps some of the registersand sends a trace event.

V2: Accomodated call to work function to send uevent

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx>
Signed-off-by: Shashank Sharma <shashank.sharma@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
    1 file changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
*adev)
        }
    }

+static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+    int r = 0, i;
+
+    /* original raven doesn't have full asic reset */
+    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+        return;
+    for (i = 0; i < adev->num_ip_blocks; i++) {
+        if (!adev->ip_blocks[i].status.valid)
+            continue;
+        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+            continue;
+        r =
+adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+        if (r)
+            DRM_ERROR("reset_reg_dumps of IP block <%s> failed
+%d\n",
+                    adev->ip_blocks[i].version->funcs->name, r);
+    }
+
+    /* Schedule work to send uevent */
+    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
+        DRM_ERROR("failed to add GPU reset work\n");
+
+    dump_stack();
+}
+
    static int nv_asic_reset(struct amdgpu_device *adev)
    {
        int ret = 0;

+    amdgpu_reset_dumps(adev);

Had a comment on this before. Now there are different reasons (or
even no reason like a precautionary reset) to perform reset. A user
would be interested in a trace only if the reason is valid.

To clarify on why a work shouldn't be scheduled on every reset, check
here -

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/am
d
gpu/amdgpu_drv.c#L2188
In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.

But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.

- Shashank



Thanks,
Lijo

        switch (nv_asic_reset_method(adev)) {
        case AMD_RESET_METHOD_PCI:
            dev_info(adev->dev, "PCI reset\n");