Re: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
[AMD Official Use Only - General]

-----Original Message-----
From: Quan, Evan <Evan.Quan@xxxxxxx>
Sent: Thursday, March 2, 2023 4:31 AM
To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
reloading scenario

[AMD Official Use Only - General]



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, March 2, 2023 5:21 PM
To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
reloading scenario



On 3/2/2023 2:43 PM, Quan, Evan wrote:
[AMD Official Use Only - General]



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, March 2, 2023 4:28 PM
To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
reloading scenario



On 3/2/2023 12:28 PM, Evan Quan wrote:
Gpu reset might be needed during driver reloading. To guard
that(gpu
reset) work, df cstate needs to be disabled properly.

Signed-off-by: Evan Quan <evan.quan@xxxxxxx>
Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
    1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 51bbeaa1f311..3c854461ef32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2816,6 +2816,15 @@ static int
amdgpu_device_ip_fini_early(struct
amdgpu_device *adev)
    	amdgpu_device_set_pg_state(adev,
AMD_PG_STATE_UNGATE);
    	amdgpu_device_set_cg_state(adev,
AMD_CG_STATE_UNGATE);

+	/*
+	 * Get df cstate disabled properly on driver unloading.
+	 * Since on the succeeding driver reloading, gpu reset might
+	 * be required. And cstate disabled is a prerequisite for
+	 * that(gpu reset).
+	 */
+	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
+		dev_warn(adev->dev, "Failed to disallow df cstate");
+

This looks more like a firmware bug. Driver sends the Unload
message to
FW.
In that case FW should disable all features including C-state.
Driver does not send the Unload message. We want PMFM alive and
ready
for handling possible gpu reset on reloading.


Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
get unloaded by default on ring destruction. Even if PSP stops, it
could just keep the heartbeat value as non-zero (just that it won't
increment).

Probably, that needs to be fixed first rather than keeping PMFW alive
for a reset.
As I remembered, the change(asic reset during reloading) seemed
introduced to address some sriov issues.
@Deucher, Alexander might share more backgrounds about this.
To be honest, I'm not a fan of this(perform asic reset during reloading).

I'm open to doing it a better way.  We did it for two reasons:
1. often times the device was left in a weird state after the driver unload/VM killed. Etc.  We needed a way to put the device into a known good state so the driver could re-initialize it.  Plus, IIRC, on some of the older ASICS, once the SMU or PSP firmware was loaded, there was no way to reload it without a reset so you needed one anyway.  This is largely why we have to reset for S4 as well.
2. Some large servers didn't power off PCI devices on reboots to save time.  This left the devices with whatever state they had before the system was rebooted which led to driver initialization problems on subsequent boots because the device was in an unknown state.

If there is a better way to handle these situations, I'm all for it.


Are those cases valid still? We have this for GFX9 - reset done only for pass through. And some of the GFX9 ASICs are used in large servers.

        /* Just return false for soc15 GPUs.  Reset does not seem to
         * be necessary.
         */
        if (!amdgpu_passthrough(adev))
                return false;

Thanks,
Lijo

Alex



Evan

Thanks,
Lijo

BR
Evan

Thanks,
Lijo

    	amdgpu_amdkfd_suspend(adev, false);

    	/* Workaroud for ASICs need to disable SMC first */



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux