On 11/13/2024 10:54 AM, Alex Deucher wrote: > On Wed, Nov 13, 2024 at 12:03 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: >> >> >> >> On 11/13/2024 10:16 AM, Alex Deucher wrote: >>> On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: >>>> >>>> >>>> >>>> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote: >>>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>>> >>>>> Hi, Lijo, >>>>> >>>>> -----Original Message----- >>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>>>> Sent: Tuesday, November 12, 2024 10:54 PM >>>>> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx> >>>>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed >>>>> >>>>> >>>>> >>>>> On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote: >>>>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP >>>>>> block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0: >>>>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu >>>>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device. >>>>>> >>>>>> Add irqs with different IRQ source pointer for vcn0 and vcn1. >>>>>> >>>>>> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------ >>>>>> 1 file changed, 13 insertions(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>>>> index ef3dfd44a022..82b90f1e6f33 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry >>>>>> vcn_reg_list_4_0_3[] = { >>>>>> >>>>>> #define NORMALIZE_VCN_REG_OFFSET(offset) \ >>>>>> (offset & 0x1FFFF) >>>>>> +static int amdgpu_ih_clientid_vcns[] = { >>>>>> + SOC15_IH_CLIENTID_VCN, >>>>>> + SOC15_IH_CLIENTID_VCN1 >>>>> >>>>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances. >>>>> >>>>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask. >>>>> >>>>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id. >>>>> >>>>> amdgpu_irq_add_id: >>>>> if (adev->irq.client[client_id].sources[src_id] != NULL) >>>>> return -EINVAL; >>>>> >>>> >>>> We had some side discussions on IP block-per-instance approach. >>>> Personally, I was not in favour of it as I thought allowing IP block to >>>> handle its own instances is the better approach and that could handle >>>> dependencies between instances. Turns out that there are more like >>>> handling common things for all instances as in this example. >>> >>> We tried that route as well before this one and it was ugly as well. >>> >>>> >>>> I would prefer to revert the patch series and consider all angles before >>>> moving forward on the approach. Will leave this to Alex/Christian/Leo on >>>> the final decsion. >>> >>> Do the attached patches fix it? >>> >> >> This is kind of a piece-meal fix. This doesn't address the larger >> problem of how to handle things common for all IP instances. > > I think we'll need to handle them as we encounter them. We can always > split common stuff out to helpers which can be used by multiple > instances. I don't think so. It made a fundamental change. We changed the base layer of considering IP as a single block. A common swinit or swfini is no longer the case. Consider how a sysfs initialization like enable_isolation could be handled if the same approach is taken for GFX IP. I would still say that we broke the current foundation with this approach and hoping that uppper layer fixes can help to hold things together. Or, it needs a start-from-scratch approach. Thanks, Lijo But I think once we get past this refactoring it will put > us in a better place for dealing with multiple IP instances. Consider > the case of a part with multiple blocks of the same type with > different IP versions. Those would not easily be handled with a > single IP block handling multiple IP instances. > > Alex > >> >> Thanks, >> Lijo >> >>> Alex >>> >>>> >>>> Thanks, >>>> Lijo >>>> >>>>> Regards >>>>> Jesse >>>>> >>>>> >>>>> Thanks, >>>>> Lijo >>>>> >>>>>> +}; >>>>>> >>>>>> static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev); >>>>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device >>>>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block) >>>>>> if (r) >>>>>> return r; >>>>>> >>>>>> - /* VCN DEC TRAP */ >>>>>> - r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN, >>>>>> - VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq); >>>>>> + /* VCN UNIFIED TRAP */ >>>>>> + r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst], >>>>>> + VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, >>>>>> +&adev->vcn.inst[inst].irq); >>>>>> if (r) >>>>>> return r; >>>>>> >>>>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct >>>>>> amdgpu_ip_block *ip_block) >>>>>> >>>>>> ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id); >>>>>> sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id); >>>>>> - r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0, >>>>>> + r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst[inst].irq, 0, >>>>>> AMDGPU_RING_PRIO_DEFAULT, >>>>>> &adev->vcn.inst[inst].sched_score); >>>>>> if (r) >>>>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = { >>>>>> */ >>>>>> static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, int >>>>>> inst) { >>>>>> - adev->vcn.inst->irq.num_types++; >>>>>> + if (adev->vcn.harvest_config & (1 << inst)) >>>>>> + return; >>>>>> + >>>>>> + adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings + 1; >>>>>> >>>>>> - adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs; >>>>>> + adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs; >>>>>> } >>>>>> >>>>>> static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block >>>>>> *ip_block, struct drm_printer *p)