Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed

Alex Deucher <alexdeucher@xxxxxxxxx> · Wed, 13 Nov 2024 16:43:48 -0500

On Wed, Nov 13, 2024 at 12:32 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
>
>
>
> On 11/13/2024 10:54 AM, Alex Deucher wrote:
> > On Wed, Nov 13, 2024 at 12:03 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
> >>
> >>
> >>
> >> On 11/13/2024 10:16 AM, Alex Deucher wrote:
> >>> On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote:
> >>>>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>>>
> >>>>> Hi, Lijo,
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >>>>> Sent: Tuesday, November 12, 2024 10:54 PM
> >>>>> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>
> >>>>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote:
> >>>>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
> >>>>>> block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0:
> >>>>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu
> >>>>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
> >>>>>>
> >>>>>> Add irqs with different IRQ source pointer for vcn0 and vcn1.
> >>>>>>
> >>>>>> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> >>>>>> ---
> >>>>>>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------
> >>>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>>>> index ef3dfd44a022..82b90f1e6f33 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>>>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry
> >>>>>> vcn_reg_list_4_0_3[] = {
> >>>>>>
> >>>>>>  #define NORMALIZE_VCN_REG_OFFSET(offset) \
> >>>>>>               (offset & 0x1FFFF)
> >>>>>> +static int amdgpu_ih_clientid_vcns[] = {
> >>>>>> +     SOC15_IH_CLIENTID_VCN,
> >>>>>> +     SOC15_IH_CLIENTID_VCN1
> >>>>>
> >>>>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances.
> >>>>>
> >>>>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask.
> >>>>>
> >>>>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id.
> >>>>>
> >>>>> amdgpu_irq_add_id:
> >>>>> if (adev->irq.client[client_id].sources[src_id] != NULL)
> >>>>>         return -EINVAL;
> >>>>>
> >>>>
> >>>> We had some side discussions on IP block-per-instance approach.
> >>>> Personally, I was not in favour of it as I thought allowing IP block to
> >>>> handle its own instances is the better approach and that could handle
> >>>> dependencies between instances. Turns out that there are more like
> >>>> handling common things for all instances as in this example.
> >>>
> >>> We tried that route as well before this one and it was ugly as well.
> >>>
> >>>>
> >>>> I would prefer to revert the patch series and consider all angles before
> >>>> moving forward on the approach. Will leave this to Alex/Christian/Leo on
> >>>> the final decsion.
> >>>
> >>> Do the attached patches fix it?
> >>>
> >>
> >> This is kind of a piece-meal fix. This doesn't address the larger
> >> problem of how to handle things common for all IP instances.
> >
> > I think we'll need to handle them as we encounter them.  We can always
> > split common stuff out to helpers which can be used by multiple
> > instances.
>
> I don't think so. It made a fundamental change. We changed the base
> layer of considering IP as a single block. A common swinit or swfini is
> no longer the case. Consider how a sysfs initialization like
> enable_isolation could be handled if the same approach is taken for GFX IP.
>
> I would still say that we broke the current foundation with this
> approach and hoping that uppper layer fixes can help to hold things
> together. Or, it needs a start-from-scratch approach.

This was the original intention when we first designed the driver and
the IP block structures.  Unfortunately all of the early chips only
had one instance of each IP block and since then we've just built up
technical debt with respect to the instance handling.  That said, I
think the rework of this level at this point is probably too much, in
digging through the current state, there are just too many corner
cases to fix up.  I agree we should probably forgo it at this point.

Alex

Alex

>
> Thanks,
> Lijo
>
>   But I think once we get past this refactoring it will put
> > us in a better place for dealing with multiple IP instances.  Consider
> > the case of a part with multiple blocks of the same type with
> > different IP versions.  Those would not easily be handled with a
> > single IP block handling multiple IP instances.
> >
> > Alex
> >
> >>
> >> Thanks,
> >> Lijo
> >>
> >>> Alex
> >>>
> >>>>
> >>>> Thanks,
> >>>> Lijo
> >>>>
> >>>>> Regards
> >>>>> Jesse
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Lijo
> >>>>>
> >>>>>> +};
> >>>>>>
> >>>>>>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
> >>>>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device
> >>>>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
> >>>>>>       if (r)
> >>>>>>               return r;
> >>>>>>
> >>>>>> -     /* VCN DEC TRAP */
> >>>>>> -     r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
> >>>>>> -             VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
> >>>>>> +     /* VCN UNIFIED TRAP */
> >>>>>> +     r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst],
> >>>>>> +                     VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE,
> >>>>>> +&adev->vcn.inst[inst].irq);
> >>>>>>       if (r)
> >>>>>>               return r;
> >>>>>>
> >>>>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct
> >>>>>> amdgpu_ip_block *ip_block)
> >>>>>>
> >>>>>>       ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id);
> >>>>>>       sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id);
> >>>>>> -     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0,
> >>>>>> +     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0,
> >>>>>>                                AMDGPU_RING_PRIO_DEFAULT,
> >>>>>>                                &adev->vcn.inst[inst].sched_score);
> >>>>>>       if (r)
> >>>>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
> >>>>>>   */
> >>>>>>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int
> >>>>>> inst)  {
> >>>>>> -     adev->vcn.inst->irq.num_types++;
> >>>>>> +     if (adev->vcn.harvest_config & (1 << inst))
> >>>>>> +             return;
> >>>>>> +
> >>>>>> +     adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1;
> >>>>>>
> >>>>>> -     adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
> >>>>>> +     adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs;
> >>>>>>  }
> >>>>>>
> >>>>>>  static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block
> >>>>>> *ip_block, struct drm_printer *p)