Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed

Alex Deucher <alexdeucher@xxxxxxxxx> · Wed, 13 Nov 2024 00:24:38 -0500

On Wed, Nov 13, 2024 at 12:03 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
>
>
>
> On 11/13/2024 10:16 AM, Alex Deucher wrote:
> > On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
> >>
> >>
> >>
> >> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote:
> >>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>
> >>> Hi, Lijo,
> >>>
> >>> -----Original Message-----
> >>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >>> Sent: Tuesday, November 12, 2024 10:54 PM
> >>> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>
> >>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed
> >>>
> >>>
> >>>
> >>> On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote:
> >>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
> >>>> block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0:
> >>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu
> >>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
> >>>>
> >>>> Add irqs with different IRQ source pointer for vcn0 and vcn1.
> >>>>
> >>>> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> >>>> ---
> >>>>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------
> >>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>> index ef3dfd44a022..82b90f1e6f33 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry
> >>>> vcn_reg_list_4_0_3[] = {
> >>>>
> >>>>  #define NORMALIZE_VCN_REG_OFFSET(offset) \
> >>>>               (offset & 0x1FFFF)
> >>>> +static int amdgpu_ih_clientid_vcns[] = {
> >>>> +     SOC15_IH_CLIENTID_VCN,
> >>>> +     SOC15_IH_CLIENTID_VCN1
> >>>
> >>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances.
> >>>
> >>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask.
> >>>
> >>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id.
> >>>
> >>> amdgpu_irq_add_id:
> >>> if (adev->irq.client[client_id].sources[src_id] != NULL)
> >>>         return -EINVAL;
> >>>
> >>
> >> We had some side discussions on IP block-per-instance approach.
> >> Personally, I was not in favour of it as I thought allowing IP block to
> >> handle its own instances is the better approach and that could handle
> >> dependencies between instances. Turns out that there are more like
> >> handling common things for all instances as in this example.
> >
> > We tried that route as well before this one and it was ugly as well.
> >
> >>
> >> I would prefer to revert the patch series and consider all angles before
> >> moving forward on the approach. Will leave this to Alex/Christian/Leo on
> >> the final decsion.
> >
> > Do the attached patches fix it?
> >
>
> This is kind of a piece-meal fix. This doesn't address the larger
> problem of how to handle things common for all IP instances.

I think we'll need to handle them as we encounter them.  We can always
split common stuff out to helpers which can be used by multiple
instances.  But I think once we get past this refactoring it will put
us in a better place for dealing with multiple IP instances.  Consider
the case of a part with multiple blocks of the same type with
different IP versions.  Those would not easily be handled with a
single IP block handling multiple IP instances.

Alex

>
> Thanks,
> Lijo
>
> > Alex
> >
> >>
> >> Thanks,
> >> Lijo
> >>
> >>> Regards
> >>> Jesse
> >>>
> >>>
> >>> Thanks,
> >>> Lijo
> >>>
> >>>> +};
> >>>>
> >>>>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
> >>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device
> >>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
> >>>>       if (r)
> >>>>               return r;
> >>>>
> >>>> -     /* VCN DEC TRAP */
> >>>> -     r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
> >>>> -             VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
> >>>> +     /* VCN UNIFIED TRAP */
> >>>> +     r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst],
> >>>> +                     VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE,
> >>>> +&adev->vcn.inst[inst].irq);
> >>>>       if (r)
> >>>>               return r;
> >>>>
> >>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct
> >>>> amdgpu_ip_block *ip_block)
> >>>>
> >>>>       ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id);
> >>>>       sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id);
> >>>> -     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0,
> >>>> +     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0,
> >>>>                                AMDGPU_RING_PRIO_DEFAULT,
> >>>>                                &adev->vcn.inst[inst].sched_score);
> >>>>       if (r)
> >>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
> >>>>   */
> >>>>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int
> >>>> inst)  {
> >>>> -     adev->vcn.inst->irq.num_types++;
> >>>> +     if (adev->vcn.harvest_config & (1 << inst))
> >>>> +             return;
> >>>> +
> >>>> +     adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1;
> >>>>
> >>>> -     adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
> >>>> +     adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs;
> >>>>  }
> >>>>
> >>>>  static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block
> >>>> *ip_block, struct drm_printer *p)