RE: [PATCH 2/2] drm/amdgpu: fix amdgpu_irq_put call trace in vcn_v4_0_hw_fini

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Mon, 8 May 2023 11:05:17 +0000

[AMD Official Use Only - General]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Horatio
> Zhang
> Sent: Monday, May 8, 2023 6:20 PM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Liu, HaoPing (Alan) <HaoPing.Liu@xxxxxxx>; Zhang, Horatio
> <Hongkun.Zhang@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; Zhou1, Tao
> <Tao.Zhou1@xxxxxxx>; Jiang, Sonny <Sonny.Jiang@xxxxxxx>; Limonciello,
> Mario <Mario.Limonciello@xxxxxxx>; Liu, Leo <Leo.Liu@xxxxxxx>; Zhang,
> Hawking <Hawking.Zhang@xxxxxxx>
> Subject: [PATCH 2/2] drm/amdgpu: fix amdgpu_irq_put call trace in
> vcn_v4_0_hw_fini
> 
> During the suspend, the vcn_v4_0_hw_init function will use the amdgpu_irq_put
> to disable the irq of vcn.inst, but it was not enabled during the resume process,
> which resulted in a call trace during the GPU reset process.
> 
> [   44.563572] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
> [   44.563629] RSP: 0018:ffffb36740edfc90 EFLAGS: 00010246
> [   44.563630] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
> 0000000000000000
> [   44.563630] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> [   44.563631] RBP: ffffb36740edfcb0 R08: 0000000000000000 R09:
> 0000000000000000
> [   44.563631] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff954c568e2ea8
> [   44.563631] R13: 0000000000000000 R14: ffff954c568c0000 R15:
> ffff954c568e2ea8
> [   44.563632] FS:  0000000000000000(0000) GS:ffff954f584c0000(0000)
> knlGS:0000000000000000
> [   44.563632] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   44.563633] CR2: 00007f028741ba70 CR3: 000000026ca10000 CR4:
> 0000000000750ee0
> [   44.563633] PKRU: 55555554
> [   44.563633] Call Trace:
> [   44.563634]  <TASK>
> [   44.563634]  vcn_v4_0_hw_fini+0x62/0x160 [amdgpu]
> [   44.563700]  vcn_v4_0_suspend+0x13/0x30 [amdgpu]
> [   44.563755]  amdgpu_device_ip_suspend_phase2+0x240/0x470 [amdgpu]
> [   44.563806]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
> [   44.563858]  amdgpu_device_pre_asic_reset+0xd9/0x4a0 [amdgpu]
> [   44.563909]  amdgpu_device_gpu_recover.cold+0x548/0xcf1 [amdgpu]
> [   44.564006]  amdgpu_debugfs_reset_work+0x4c/0x80 [amdgpu]
> [   44.564061]  process_one_work+0x21f/0x400
> [   44.564062]  worker_thread+0x200/0x3f0
> [   44.564063]  ? process_one_work+0x400/0x400
> [   44.564064]  kthread+0xee/0x120
> [   44.564065]  ? kthread_complete_and_exit+0x20/0x20
> [   44.564066]  ret_from_fork+0x22/0x30
> 
> Fixes: ea5309de7388 ("drm/amdgpu: add VCN 4.0 RAS poison consumption
> handling")
> Signed-off-by: Horatio Zhang <Hongkun.Zhang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> index bf0674039598..b55eb1bf3e30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> @@ -281,6 +281,21 @@ static int vcn_v4_0_hw_init(void *handle)
>  	return r;
>  }
> 
> +static int vcn_v4_0_late_init(void *handle) {
> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> +	int i;
> +
> +	for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> +		if (adev->vcn.harvest_config & (1 << i))
> +			continue;
> +
> +		amdgpu_irq_get(adev, &adev->vcn.inst[i].irq, 0);

[Tao] we can also check its return value and exit if the r is none-zero. But either way is fine with me.

> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * vcn_v4_0_hw_fini - stop the hardware block
>   *
> @@ -2047,7 +2062,7 @@ static void vcn_v4_0_set_irq_funcs(struct
> amdgpu_device *adev)  static const struct amd_ip_funcs vcn_v4_0_ip_funcs = {
>  	.name = "vcn_v4_0",
>  	.early_init = vcn_v4_0_early_init,
> -	.late_init = NULL,
> +	.late_init = vcn_v4_0_late_init,
>  	.sw_init = vcn_v4_0_sw_init,
>  	.sw_fini = vcn_v4_0_sw_fini,
>  	.hw_init = vcn_v4_0_hw_init,
> --
> 2.34.1