On Wed, Nov 13, 2024 at 12:03 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: > > > > On 11/13/2024 10:16 AM, Alex Deucher wrote: > > On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: > >> > >> > >> > >> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote: > >>> [AMD Official Use Only - AMD Internal Distribution Only] > >>> > >>> Hi, Lijo, > >>> > >>> -----Original Message----- > >>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > >>> Sent: Tuesday, November 12, 2024 10:54 PM > >>> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx> > >>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed > >>> > >>> > >>> > >>> On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote: > >>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP > >>>> block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0: > >>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu > >>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device. > >>>> > >>>> Add irqs with different IRQ source pointer for vcn0 and vcn1. > >>>> > >>>> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > >>>> --- > >>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------ > >>>> 1 file changed, 13 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c > >>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c > >>>> index ef3dfd44a022..82b90f1e6f33 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c > >>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry > >>>> vcn_reg_list_4_0_3[] = { > >>>> > >>>> #define NORMALIZE_VCN_REG_OFFSET(offset) \ > >>>> (offset & 0x1FFFF) > >>>> +static int amdgpu_ih_clientid_vcns[] = { > >>>> + SOC15_IH_CLIENTID_VCN, > >>>> + SOC15_IH_CLIENTID_VCN1 > >>> > >>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances. > >>> > >>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask. > >>> > >>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id. > >>> > >>> amdgpu_irq_add_id: > >>> if (adev->irq.client[client_id].sources[src_id] != NULL) > >>> return -EINVAL; > >>> > >> > >> We had some side discussions on IP block-per-instance approach. > >> Personally, I was not in favour of it as I thought allowing IP block to > >> handle its own instances is the better approach and that could handle > >> dependencies between instances. Turns out that there are more like > >> handling common things for all instances as in this example. > > > > We tried that route as well before this one and it was ugly as well. > > > >> > >> I would prefer to revert the patch series and consider all angles before > >> moving forward on the approach. Will leave this to Alex/Christian/Leo on > >> the final decsion. > > > > Do the attached patches fix it? > > > > This is kind of a piece-meal fix. This doesn't address the larger > problem of how to handle things common for all IP instances. I think we'll need to handle them as we encounter them. We can always split common stuff out to helpers which can be used by multiple instances. But I think once we get past this refactoring it will put us in a better place for dealing with multiple IP instances. Consider the case of a part with multiple blocks of the same type with different IP versions. Those would not easily be handled with a single IP block handling multiple IP instances. Alex > > Thanks, > Lijo > > > Alex > > > >> > >> Thanks, > >> Lijo > >> > >>> Regards > >>> Jesse > >>> > >>> > >>> Thanks, > >>> Lijo > >>> > >>>> +}; > >>>> > >>>> static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev); > >>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device > >>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block) > >>>> if (r) > >>>> return r; > >>>> > >>>> - /* VCN DEC TRAP */ > >>>> - r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN, > >>>> - VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq); > >>>> + /* VCN UNIFIED TRAP */ > >>>> + r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst], > >>>> + VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, > >>>> +&adev->vcn.inst[inst].irq); > >>>> if (r) > >>>> return r; > >>>> > >>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct > >>>> amdgpu_ip_block *ip_block) > >>>> > >>>> ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id); > >>>> sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id); > >>>> - r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0, > >>>> + r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0, > >>>> AMDGPU_RING_PRIO_DEFAULT, > >>>> &adev->vcn.inst[inst].sched_score); > >>>> if (r) > >>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = { > >>>> */ > >>>> static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int > >>>> inst) { > >>>> - adev->vcn.inst->irq.num_types++; > >>>> + if (adev->vcn.harvest_config & (1 << inst)) > >>>> + return; > >>>> + > >>>> + adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1; > >>>> > >>>> - adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs; > >>>> + adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs; > >>>> } > >>>> > >>>> static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block > >>>> *ip_block, struct drm_printer *p)