[AMD Official Use Only - AMD Internal Distribution Only] HI Alex, -----Original Message----- From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> Sent: Wednesday, November 13, 2024 1:03 PM To: Alex Deucher <alexdeucher@xxxxxxxxx> Cc: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed On 11/13/2024 10:16 AM, Alex Deucher wrote: > On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: >> >> >> >> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> Hi, Lijo, >>> >>> -----Original Message----- >>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>> Sent: Tuesday, November 12, 2024 10:54 PM >>> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, >>> Christian <Christian.Koenig@xxxxxxx>; Prosyak, Vitaly >>> <Vitaly.Prosyak@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx> >>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed >>> >>> >>> >>> On 11/12/2024 8:00 PM, Jesse.zhang@xxxxxxx wrote: >>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of >>>> IP block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0: >>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu >>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device. >>>> >>>> Add irqs with different IRQ source pointer for vcn0 and vcn1. >>>> >>>> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------ >>>> 1 file changed, 13 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>> index ef3dfd44a022..82b90f1e6f33 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c >>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry >>>> vcn_reg_list_4_0_3[] = { >>>> >>>> #define NORMALIZE_VCN_REG_OFFSET(offset) \ >>>> (offset & 0x1FFFF) >>>> +static int amdgpu_ih_clientid_vcns[] = { >>>> + SOC15_IH_CLIENTID_VCN, >>>> + SOC15_IH_CLIENTID_VCN1 >>> >>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances. >>> >>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask. >>> >>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id. >>> >>> amdgpu_irq_add_id: >>> if (adev->irq.client[client_id].sources[src_id] != NULL) >>> return -EINVAL; >>> >> >> We had some side discussions on IP block-per-instance approach. >> Personally, I was not in favour of it as I thought allowing IP block >> to handle its own instances is the better approach and that could >> handle dependencies between instances. Turns out that there are more >> like handling common things for all instances as in this example. > > We tried that route as well before this one and it was ugly as well. > >> >> I would prefer to revert the patch series and consider all angles >> before moving forward on the approach. Will leave this to >> Alex/Christian/Leo on the final decsion. > > Do the attached patches fix it? Yes, using your patch vcn sw init will pass. Please help to check this patch: [PATCH 1/2] drm/admgpu: fix vcn sw init failed Otherwise duplicate sysfs warnings will be displayed. sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/vcn_reset_mask' [ 562.443738] CPU: 13 PID: 4888 Comm: modprobe Tainted: G E 6.10.0+ #51 [ 562.443740] Hardware name: AMD Splinter/Splinter-RPL, BIOS VS2683299N.FD 05/10/2023 [ 562.443741] Call Trace: [ 562.443743] <TASK> [ 562.443746] dump_stack_lvl+0x70/0x90 [ 562.443751] dump_stack+0x14/0x20 [ 562.443753] sysfs_warn_dup+0x60/0x80 [ 562.443757] sysfs_add_file_mode_ns+0x126/0x130 [ 562.443760] sysfs_create_file_ns+0x68/0xa0 [ 562.443762] device_create_file+0x46/0x90 [ 562.443766] amdgpu_vcn_sysfs_reset_mask_init+0x1c/0x30 [amdgpu] [ 562.443991] vcn_v4_0_3_sw_init+0x270/0x3e0 [amdgpu] [ 562.444120] amdgpu_device_init+0x1a0e/0x35a0 [amdgpu] Regards Jesse This is kind of a piece-meal fix. This doesn't address the larger problem of how to handle things common for all IP instances. Thanks, Lijo > Alex > >> >> Thanks, >> Lijo >> >>> Regards >>> Jesse >>> >>> >>> Thanks, >>> Lijo >>> >>>> +}; >>>> >>>> static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev); >>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device >>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block) >>>> if (r) >>>> return r; >>>> >>>> - /* VCN DEC TRAP */ >>>> - r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN, >>>> - VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq); >>>> + /* VCN UNIFIED TRAP */ >>>> + r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst], >>>> + VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, >>>> +&adev->vcn.inst[inst].irq); >>>> if (r) >>>> return r; >>>> >>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct >>>> amdgpu_ip_block *ip_block) >>>> >>>> ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id); >>>> sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id); >>>> - r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0, >>>> + r = amdgpu_ring_init(adev, ring, 512, >>>> + &adev->vcn.inst[inst].irq, 0, >>>> AMDGPU_RING_PRIO_DEFAULT, >>>> &adev->vcn.inst[inst].sched_score); >>>> if (r) >>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = { >>>> */ >>>> static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev, >>>> int >>>> inst) { >>>> - adev->vcn.inst->irq.num_types++; >>>> + if (adev->vcn.harvest_config & (1 << inst)) >>>> + return; >>>> + >>>> + adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings >>>> + + 1; >>>> >>>> - adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs; >>>> + adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs; >>>> } >>>> >>>> static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block >>>> *ip_block, struct drm_printer *p)