Re: [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend

Gerry Liu <gerry@xxxxxxxxxxxxxxxxx> · Tue, 14 Jan 2025 11:12:38 +0800



> 2025年1月11日 02:57，Mario Limonciello <mario.limonciello@xxxxxxx> 写道：
> 
> On 1/9/2025 20:08, Jiang Liu wrote:
>> Clear adev->in_suspend flag when fails to suspend, otherwise it will
>> cause too much warnings like:
>> [ 1802.212027] ------------[ cut here ]------------
>> [ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
>> [ 1802.212223]  ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
>> [ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
>> [ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> [ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
>> [ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
>> [ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
>> [ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
>> [ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
>> [ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
>> [ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
>> [ 1802.212411] FS:  00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
>> [ 1802.212412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
>> [ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 1802.212415] PKRU: 55555554
>> [ 1802.212415] Call Trace:
>> [ 1802.212416]  <TASK>
>> [ 1802.212417]  ? __warn+0x83/0x130
>> [ 1802.212419]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212586]  ? __report_bug+0xea/0x100
>> [ 1802.212588]  ? report_bug+0x24/0x70
>> [ 1802.212589]  ? handle_bug+0x3c/0x70
>> [ 1802.212590]  ? exc_invalid_op+0x18/0x70
>> [ 1802.212591]  ? asm_exc_invalid_op+0x1a/0x20
>> [ 1802.212594]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212746]  amdgpu_ring_fini+0x91/0x120 [amdgpu]
>> [ 1802.212901]  amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
>> [ 1802.213106]  amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
>> [ 1802.213247]  amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
>> [ 1802.213413]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>> [ 1802.213576]  devm_drm_dev_init_release+0x4e/0x70 [drm]
>> [ 1802.213602]  release_nodes+0x35/0xb0
>> [ 1802.213605]  devres_release_all+0x8f/0xd0
>> [ 1802.213606]  device_unbind_cleanup+0xe/0x70
>> [ 1802.213609]  device_release_driver_internal+0x1bc/0x200
>> [ 1802.213611]  driver_detach+0x48/0x90
>> [ 1802.213613]  bus_remove_driver+0x6d/0xf0
>> [ 1802.213615]  pci_unregister_driver+0x2e/0xb0
>> [ 1802.213618]  amdgpu_exit+0x15/0x1c4 [amdgpu]
>> [ 1802.213861]  __do_sys_delete_module.constprop.0+0x176/0x310
>> [ 1802.213863]  do_syscall_64+0x5d/0x170
>> [ 1802.213866]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [ 1802.213869] RIP: 0033:0x7f927d6620cb
>> [ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
>> [ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
>> [ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
>> [ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
>> [ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
>> [ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
>> [ 1802.213878]  </TASK>
>> [ 1802.213879] ---[ end trace 0000000000000000 ]---
> 
> This patch seems like it's attacking two different problems; a failure to suspend and a failure to resume.  Some comments inline on each.
> 
> 
>> Signed-off-by: Jiang Liu <gerry@xxxxxxxxxxxxxxxxx>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 0a121aab5c74..5ff53a3b9851 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>  		amdgpu_virt_fini_data_exchange(adev);
>>  		r = amdgpu_virt_request_full_gpu(adev, false);
>>  		if (r)
>> -			return r;
>> +			goto error_out;
>>  	}
>>    	if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
>> @@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>    	r = amdgpu_device_evict_resources(adev);
>>  	if (r)
>> -		return r;
>> +		goto error_out;
>>    	amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>  @@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>    	r = amdgpu_dpm_notify_rlc_state(adev, false);
>>  	if (r)
>> -		return r;
>> +		goto error_out;
> 
> For the changes to suspend, I would worry that this isn't really "unwinding" the suspend.  I think that if you just "abort" and say in_suspend = false this is going to leave some IP in an inconsistent state depending on where you failed.  For example let's (hypothetically) say this is a dGPU that you failed to evict resources.
> 
> You would have still called amdgpu_device_ip_suspend_phase1() which means that all the valid blocks would have had amdgpu_ip_block_suspend() called.  This calls an individual block's suspend() callback but not it's resume().
> 
> The failure code for amdgpu_device_suspend() would bubble up to the pmops handlers.  I think if it's called from runtime pm (amdgpu_pmops_runtime_suspend) then the runtime support is no longer functional.  If it's from amdgpu_pmops_suspend() or amdgpu_pmops_freeze() though I don't think anyone calls asic reset.
> 
> To change that flag I think you either need to unwind everything or we need a recovery path in the callers to do amdgpu_asic_reset().
Hi Mario,
	You are right. My fix is just hiding the warning instead of solving the issue.
I have tried to handle the failure at amdgpu_pmops_suspend() by calling amdgpu_asic_reset(). But that doesn’t really fix the issue, it will cause following `amd-smi` hang forever with many kernel warnings.
So seem I can’t find a reasonable way to handle the situation and have to drop this patch:(


> 
>>    	return 0;
>> +error_out:
>> +	adev->in_suspend = false;
>> +	return r;
>>  }
>>    /**
>> @@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>  		amdgpu_virt_release_full_gpu(adev, true);
>>  	}
>>  -	if (r)
>> +	if (r) {
>> +		adev->in_suspend = false;
>>  		return r;
>> +	}
>>    	/* Make sure IB tests flushed */
>>  	flush_delayed_work(&adev->delayed_init_work);
> 
> With the resume path is the failure to resume a real problem?  Or is that just hypothetical?
> 
> Similarly; I don't think you can just say you're not in suspend anymore.  If this is just hypothetical I think this chunk should be dropped.
> 
> If it's real, again maybe a better recovery is a reset.