Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed

Alex Deucher <alexdeucher@xxxxxxxxx> · Tue, 12 Nov 2024 23:46:54 -0500

On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
>
>
>
> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi, Lijo,
> >
> > -----Original Message-----
> > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> > Sent: Tuesday, November 12, 2024 10:54 PM
> > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>
> > Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed
> >
> >
> >
> > On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote:
> >> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
> >> block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0:
> >> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu
> >> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
> >>
> >> Add irqs with different IRQ source pointer for vcn0 and vcn1.
> >>
> >> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >> index ef3dfd44a022..82b90f1e6f33 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry
> >> vcn_reg_list_4_0_3[] = {
> >>
> >>  #define NORMALIZE_VCN_REG_OFFSET(offset) \
> >>               (offset & 0x1FFFF)
> >> +static int amdgpu_ih_clientid_vcns[] = {
> >> +     SOC15_IH_CLIENTID_VCN,
> >> +     SOC15_IH_CLIENTID_VCN1
> >
> > This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances.
> >
> > I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask.
> >
> > If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id.
> >
> > amdgpu_irq_add_id:
> > if (adev->irq.client[client_id].sources[src_id] != NULL)
> >         return -EINVAL;
> >
>
> We had some side discussions on IP block-per-instance approach.
> Personally, I was not in favour of it as I thought allowing IP block to
> handle its own instances is the better approach and that could handle
> dependencies between instances. Turns out that there are more like
> handling common things for all instances as in this example.

We tried that route as well before this one and it was ugly as well.

>
> I would prefer to revert the patch series and consider all angles before
> moving forward on the approach. Will leave this to Alex/Christian/Leo on
> the final decsion.

Do the attached patches fix it?

Alex

>
> Thanks,
> Lijo
>
> > Regards
> > Jesse
> >
> >
> > Thanks,
> > Lijo
> >
> >> +};
> >>
> >>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
> >> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device
> >> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
> >>       if (r)
> >>               return r;
> >>
> >> -     /* VCN DEC TRAP */
> >> -     r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
> >> -             VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
> >> +     /* VCN UNIFIED TRAP */
> >> +     r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst],
> >> +                     VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE,
> >> +&adev->vcn.inst[inst].irq);
> >>       if (r)
> >>               return r;
> >>
> >> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct
> >> amdgpu_ip_block *ip_block)
> >>
> >>       ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id);
> >>       sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id);
> >> -     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0,
> >> +     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0,
> >>                                AMDGPU_RING_PRIO_DEFAULT,
> >>                                &adev->vcn.inst[inst].sched_score);
> >>       if (r)
> >> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
> >>   */
> >>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int
> >> inst)  {
> >> -     adev->vcn.inst->irq.num_types++;
> >> +     if (adev->vcn.harvest_config & (1 << inst))
> >> +             return;
> >> +
> >> +     adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1;
> >>
> >> -     adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
> >> +     adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs;
> >>  }
> >>
> >>  static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block
> >> *ip_block, struct drm_printer *p)
From 939a93c835c130720a346dfb6baad297f1f26e25 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Tue, 12 Nov 2024 23:26:52 -0500
Subject: [PATCH 2/2] drm/amdgpu: use a single set of interrupt funcs for vcn
 4.0.3

There a single client and source id.  The node id from the IV
ring is used to determine which instance the interrupt belongs
to.

Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index 88cbf1a88a07..8534f5370094 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -153,7 +153,7 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
 	/* VCN DEC TRAP */
 	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
 		VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
-	if (r)
+	if (r && r != -EEXIST)
 		return r;
 
 	volatile struct amdgpu_vcn4_fw_shared *fw_shared;
@@ -1741,7 +1741,7 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
  */
 static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int inst)
 {
-	adev->vcn.inst->irq.num_types++;
+	adev->vcn.inst->irq.num_types = 1;
 
 	adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
 }
-- 
2.47.0

From c9814f2b031a8a08c53a67e91e25f4e06b3e0b19 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Tue, 12 Nov 2024 22:51:01 -0500
Subject: [PATCH 1/2] drm/amdgpu: return -EEXIST if an interrupt handler exists

So we can tell is one is already registered.

Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 19ce4da285e8..2f66b8bba3c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -408,7 +408,7 @@ int amdgpu_irq_add_id(struct amdgpu_device *adev,
 	}
 
 	if (adev->irq.client[client_id].sources[src_id] != NULL)
-		return -EINVAL;
+		return -EEXIST;
 
 	if (source->num_types && !source->enabled_types) {
 		atomic_t *types;
-- 
2.47.0