Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 15 Nov 2024 13:34:45 +0100

Am 13.11.24 um 22:43 schrieb Alex Deucher:
On Wed, Nov 13, 2024 at 12:32 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:


On 11/13/2024 10:54 AM, Alex Deucher wrote:
On Wed, Nov 13, 2024 at 12:03 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:


On 11/13/2024 10:16 AM, Alex Deucher wrote:
On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:


On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Lijo,

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Tuesday, November 12, 2024 10:54 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed



On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote:
[ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0:
amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu
0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

Add irqs with different IRQ source pointer for vcn0 and vcn1.

Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------
  1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index ef3dfd44a022..82b90f1e6f33 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry
vcn_reg_list_4_0_3[] = {

  #define NORMALIZE_VCN_REG_OFFSET(offset) \
               (offset & 0x1FFFF)
+static int amdgpu_ih_clientid_vcns[] = {
+     SOC15_IH_CLIENTID_VCN,
+     SOC15_IH_CLIENTID_VCN1
This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances.

I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask.

If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id.

amdgpu_irq_add_id:
if (adev->irq.client[client_id].sources[src_id] != NULL)
         return -EINVAL;

We had some side discussions on IP block-per-instance approach.
Personally, I was not in favour of it as I thought allowing IP block to
handle its own instances is the better approach and that could handle
dependencies between instances. Turns out that there are more like
handling common things for all instances as in this example.
We tried that route as well before this one and it was ugly as well.

I would prefer to revert the patch series and consider all angles before
moving forward on the approach. Will leave this to Alex/Christian/Leo on
the final decsion.
Do the attached patches fix it?

This is kind of a piece-meal fix. This doesn't address the larger
problem of how to handle things common for all IP instances.
I think we'll need to handle them as we encounter them.  We can always
split common stuff out to helpers which can be used by multiple
instances.
I don't think so. It made a fundamental change. We changed the base
layer of considering IP as a single block. A common swinit or swfini is
no longer the case. Consider how a sysfs initialization like
enable_isolation could be handled if the same approach is taken for GFX IP.

I would still say that we broke the current foundation with this
approach and hoping that uppper layer fixes can help to hold things
together. Or, it needs a start-from-scratch approach.
This was the original intention when we first designed the driver and
the IP block structures.  Unfortunately all of the early chips only
had one instance of each IP block and since then we've just built up
technical debt with respect to the instance handling.  That said, I
think the rework of this level at this point is probably too much, in
digging through the current state, there are just too many corner
cases to fix up.  I agree we should probably forgo it at this point.

I would really like to keep the design as is. The fallout we see is 
basically just points out that we have some more broken concepts here.

For example sysfs file should never be initialized from IP specific 
functions in the first place. Why the heck do we do that?

We should probably rather keep the design for the VCN generation Boyuan 
actually tested and see what else we need to fix to get to that design.

Regards,
Christian.


Alex


Alex

Thanks,
Lijo

   But I think once we get past this refactoring it will put
us in a better place for dealing with multiple IP instances.  Consider
the case of a part with multiple blocks of the same type with
different IP versions.  Those would not easily be handled with a
single IP block handling multiple IP instances.

Alex

Thanks,
Lijo

Alex

Thanks,
Lijo

Regards
Jesse


Thanks,
Lijo

+};

  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device
*adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
       if (r)
               return r;

-     /* VCN DEC TRAP */
-     r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
-             VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
+     /* VCN UNIFIED TRAP */
+     r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst],
+                     VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE,
+&adev->vcn.inst[inst].irq);
       if (r)
               return r;

@@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct
amdgpu_ip_block *ip_block)

       ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id);
       sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id);
-     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0,
+     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0,
                                AMDGPU_RING_PRIO_DEFAULT,
                                &adev->vcn.inst[inst].sched_score);
       if (r)
@@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
   */
  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int
inst)  {
-     adev->vcn.inst->irq.num_types++;
+     if (adev->vcn.harvest_config & (1 << inst))
+             return;
+
+     adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1;

-     adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
+     adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs;
  }

  static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block
*ip_block, struct drm_printer *p)