On Wed, Nov 29, 2023 at 1:02 PM Mario Limonciello <mario.limonciello@xxxxxxx> wrote: > > On 11/29/2023 02:51, Ma Jun wrote: > > Some platforms can't resume from d3cold state, So add a > > new module parameter to disable d3cold state for debugging > > purpose or workaround. > > > > Signed-off-by: Ma Jun <Jun.Ma2@xxxxxxx> > > --- > > This patch is essentially an 'amdgpu knob' for d3cold on the root port. > At least for debugging purposes we also have a sysfs file > 'd3cold_allowed' that will enact the same behavior. > > I do have a patch that I proposed to PCI core that stops d3cold_allowed > from working in favor of requesting pcie_port_pm=off to be used instead > for debugging purposes. > > However that's a 'relatively big' debugging knob however as it will > apply to all PCIe root ports. > > Considering above I'm in favor of this being available as a localized > debugging path for just the root port the dGPU is connected to. What functionality does this option provide that runpm=0 does not? AFAIK, the pci core should not enter d3cold at runtime if the driver doesn't call pm_runtime_allow(). Alex > > Some comments below though: > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++ > > 3 files changed, 16 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index a9f54df9d33e..db9f60790267 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -166,6 +166,7 @@ extern char amdgpu_lockup_timeout[AMDGPU_MAX_TIMEOUT_PARAM_LENGTH]; > > extern int amdgpu_dpm; > > extern int amdgpu_fw_load_type; > > extern int amdgpu_aspm; > > +extern int amdgpu_d3cold; > > extern int amdgpu_runtime_pm; > > extern uint amdgpu_ip_block_mask; > > extern int amdgpu_bapm; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 22b6a910b7f2..90501c44e7d0 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -264,6 +264,13 @@ bool amdgpu_device_supports_px(struct drm_device *dev) > > bool amdgpu_device_supports_boco(struct drm_device *dev) > > { > > struct amdgpu_device *adev = drm_to_adev(dev); > > + struct pci_dev *parent; > > + > > + if (!amdgpu_d3cold) { > > + parent = pcie_find_root_port(adev->pdev); > > + pci_d3cold_disable(parent); > > + return false; > > + } > > > > if (adev->has_pr3 || > > ((adev->flags & AMD_IS_PX) && amdgpu_is_atpx_hybrid())) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index 5f14f04cb553..c9fbb8bd4169 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -145,6 +145,7 @@ char amdgpu_lockup_timeout[AMDGPU_MAX_TIMEOUT_PARAM_LENGTH]; > > int amdgpu_dpm = -1; > > int amdgpu_fw_load_type = -1; > > int amdgpu_aspm = -1; > > +int amdgpu_d3cold = -1; > > If this was chained to a larger workaround (such as automatically > applying to a DMI quirk) it would make sense as int and with using > -1 for auto. However there is a pretty dramatic downside for using this > knob that it can break s2idle. > > In my testing I've found that the following happens on an A+A design > after s2idle with this parameter in use. > > [ 70.572270] pcieport 0000:01:00.0: Unable to change power state from > D3cold to D0, device inaccessible > [ 70.572481] pcieport 0000:02:00.0: Unable to change power state from > D3cold to D0, device inaccessible > [ 72.855769] amdgpu 0000:03:00.0: not ready 1023ms after resume; waiting > [ 73.943545] amdgpu 0000:03:00.0: not ready 2047ms after resume; waiting > [ 76.055602] amdgpu 0000:03:00.0: not ready 4095ms after resume; waiting > [ 80.279550] amdgpu 0000:03:00.0: not ready 8191ms after resume; waiting > [ 88.983562] amdgpu 0000:03:00.0: not ready 16383ms after resume; waiting > [ 105.879581] amdgpu 0000:03:00.0: not ready 32767ms after resume; waiting > [ 142.743646] amdgpu 0000:03:00.0: not ready 65535ms after resume; > giving up > [ 142.743793] amdgpu 0000:03:00.0: Unable to change power state from > D3cold to D0, device inaccessible > [ 142.804011] snd_hda_intel 0000:03:00.1: Unable to change power state > from D3cold to D0, device inaccessible > > So I don't see us ever automatically using this and it should be > debugging only. IOW this doesn't need to be integer; it can be boolean. > > > int amdgpu_runtime_pm = -1; > > uint amdgpu_ip_block_mask = 0xffffffff; > > int amdgpu_bapm = -1; > > @@ -359,6 +360,13 @@ module_param_named(fw_load_type, amdgpu_fw_load_type, int, 0444); > > MODULE_PARM_DESC(aspm, "ASPM support (1 = enable, 0 = disable, -1 = auto)"); > > module_param_named(aspm, amdgpu_aspm, int, 0444); > > > > +/** > > + * DOC: d3cold (int) > > If you flip it to boolean as I suggested this should probably either > rename to disable_d3cold or you should default to TRUE. > > > + * To disable d3cold (1 = enable, 0 = disable). The default is -1 (auto, enabled). > > + */ > > +MODULE_PARM_DESC(d3cold, "d3cold support (1 = enable, 0 = disable, -1 = auto)"); > > +module_param_named(d3cold, amdgpu_d3cold, int, 0444); > > + > > /** > > * DOC: runpm (int) > > * Override for runtime power management control for dGPUs. The amdgpu driver can dynamically power down >