RE: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]



> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Wednesday, April 20, 2022 16:14
> To: Paul Menzel <pmenzel@xxxxxxxxxxxxx>
> Cc: Gong, Richard <Richard.Gong@xxxxxxx>; Dave Airlie <airlied@xxxxxxxx>;
> Pan, Xinhui <Xinhui.Pan@xxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>;
> Maling list - DRI developers <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list
> <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Deucher,
> Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>; Limonciello, Mario
> <Mario.Limonciello@xxxxxxx>
> Subject: Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based
> systems
> 
> On Wed, Apr 20, 2022 at 5:02 PM Paul Menzel <pmenzel@xxxxxxxxxxxxx>
> wrote:
> >
> > Dear Richard,
> >
> >
> > Am 20.04.22 um 22:56 schrieb Gong, Richard:
> >
> > > On 4/20/2022 3:48 PM, Paul Menzel wrote:
> >
> > >> Am 20.04.22 um 22:40 schrieb Alex Deucher:
> > >>> On Wed, Apr 20, 2022 at 4:29 PM Paul Menzel
> <pmenzel@xxxxxxxxxxxxx>
> > >>> wrote:
> > >>
> > >>>> Am 19.04.22 um 23:46 schrieb Gong, Richard:
> > >>>>
> > >>>>> On 4/14/2022 2:52 AM, Paul Menzel wrote:
> > >>>>>> [Cc: -kernel test robot <lkp@xxxxxxxxx>]
> > >>>>
> > >>>> [...]
> > >>>>
> > >>>>>> Am 13.04.22 um 15:00 schrieb Alex Deucher:
> > >>>>>>> On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:
> > >>>>>>
> > >>>>>>>> Thank you for sending out v4.
> > >>>>>>>>
> > >>>>>>>> Am 12.04.22 um 23:50 schrieb Richard Gong:
> > >>>>>>>>> Active State Power Management (ASPM) feature is enabled since
> > >>>>>>>>> kernel 5.14.
> > >>>>>>>>> There are some AMD GFX cards (such as WX3200 and RX640) that
> won't
> > >>>>>>>>> work
> > >>>>>>>>> with ASPM-enabled Intel Alder Lake based systems. Using these
> GFX
> > >>>>>>>>> cards as
> > >>>>>>>>> video/display output, Intel Alder Lake based systems will hang
> > >>>>>>>>> during
> > >>>>>>>>> suspend/resume.
> > >>>>
> > >>>> [Your email program wraps lines in cited text for some reason, making
> > >>>> the citation harder to read.]
> > >>>>
> > >>>>>>>>
> > >>>>>>>> I am still not clear, what "hang during suspend/resume" means. I
> > >>>>>>>> guess
> > >>>>>>>> suspending works fine? During resume (S3 or S0ix?), where does
> > >>>>>>>> it hang?
> > >>>>>>>> The system is functional, but there are only display problems?
> > >>>>> System freeze after suspend/resume.
> > >>>>
> > >>>> But you see certain messages still? At what point does it freeze
> > >>>> exactly? In the bug report you posted Linux messages.
> > >>>>
> > >>>>>>>>> The issue was initially reported on one system (Dell Precision
> > >>>>>>>>> 3660
> > >>>>>>>>> with
> > >>>>>>>>> BIOS version 0.14.81), but was later confirmed to affect at
> > >>>>>>>>> least 4
> > >>>>>>>>> Alder
> > >>>>>>>>> Lake based systems.
> > >>>>>>>>>
> > >>>>>>>>> Add extra check to disable ASPM on Intel Alder Lake based
> systems.
> > >>>>>>>>>
> > >>>>>>>>> Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> > >>>>>>>>> Link:
> > >>>>>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.fr
> eedesktop.org%2Fdrm%2Famd%2F-
> %2Fissues%2F1885&amp;data=05%7C01%7Cmario.limonciello%40amd.com%7
> Ce74863210c324bc6fda608da2312b506%7C3dd8961fe4884e608e11a82d994e1
> 83d%7C0%7C0%7C637860860514174025%7CUnknown%7CTWFpbGZsb3d8eyJ
> WIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C
> 3000%7C%7C%7C&amp;sdata=NUGXlybuH3volccVuN%2BGQ0kXwsOfCqM%2F
> wqHL6%2F%2FGYUc%3D&amp;reserved=0
> > >>>>>>>>>
> > >>>>
> > >>>> Thank you Microsoft Outlook for keeping us safe. :(
> > >>>>
> > >>>>>>>>>
> > >>>>>>>>> Reported-by: kernel test robot <lkp@xxxxxxxxx>
> > >>>>>>>>
> > >>>>>>>> This tag is a little confusing. Maybe clarify that it was for an
> > >>>>>>>> issue
> > >>>>>>>> in a previous patch iteration?
> > >>>>>
> > >>>>> I did describe in change-list version 3 below, which corrected the
> > >>>>> build
> > >>>>> error with W=1 option.
> > >>>>>
> > >>>>> It is not good idea to add the description for that to the commit
> > >>>>> message, this is why I add descriptions on change-list version 3.
> > >>>>
> > >>>> Do as you wish, but the current style is confusing, and readers of the
> > >>>> commit are going to think, the kernel test robot reported the problem
> > >>>> with AMD VI ASICs and Intel Alder Lake systems.
> > >>>>
> > >>>>>>>>
> > >>>>>>>>> Signed-off-by: Richard Gong <richard.gong@xxxxxxx>
> > >>>>>>>>> ---
> > >>>>>>>>> v4: s/CONFIG_X86_64/CONFIG_X86
> > >>>>>>>>>        enhanced check logic
> > >>>>>>>>> v3: s/intel_core_asom_chk/aspm_support_quirk_check
> > >>>>>>>>>        correct build error with W=1 option
> > >>>>>>>>> v2: correct commit description
> > >>>>>>>>>        move the check from chip family to problematic platform
> > >>>>>>>>> ---
> > >>>>>>>>>     drivers/gpu/drm/amd/amdgpu/vi.c | 17 ++++++++++++++++-
> > >>>>>>>>>     1 file changed, 16 insertions(+), 1 deletion(-)
> > >>>>>>>>>
> > >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c
> > >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vi.c
> > >>>>>>>>> index 039b90cdc3bc..b33e0a9bee65 100644
> > >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vi.c
> > >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vi.c
> > >>>>>>>>> @@ -81,6 +81,10 @@
> > >>>>>>>>>     #include "mxgpu_vi.h"
> > >>>>>>>>>     #include "amdgpu_dm.h"
> > >>>>>>>>>
> > >>>>>>>>> +#if IS_ENABLED(CONFIG_X86)
> > >>>>>>>>> +#include <asm/intel-family.h>
> > >>>>>>>>> +#endif
> > >>>>>>>>> +
> > >>>>>>>>>     #define ixPCIE_LC_L1_PM_SUBSTATE    0x100100C6
> > >>>>>>>>>     #define
> > >>>>>>>>>
> PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK
> > >>>>>>>>> 0x00000001L
> > >>>>>>>>>     #define
> PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
> > >>>>>>>>> 0x00000002L
> > >>>>>>>>> @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct
> > >>>>>>>>> amdgpu_device *adev)
> > >>>>>>>>>                 WREG32_PCIE(ixPCIE_LC_CNTL, data);
> > >>>>>>>>>     }
> > >>>>>>>>>
> > >>>>>>>>> +static bool aspm_support_quirk_check(void)
> > >>>>>>>>> +{
> > >>>>>>>>> +     if (IS_ENABLED(CONFIG_X86)) {
> > >>>>>>>>> +             struct cpuinfo_x86 *c = &cpu_data(0);
> > >>>>>>>>> +
> > >>>>>>>>> +             return !(c->x86 == 6 && c->x86_model ==
> > >>>>>>>>> INTEL_FAM6_ALDERLAKE);
> > >>>>>>>>> +     }
> > >>>>>>>>> +
> > >>>>>>>>> +     return true;
> > >>>>>>>>> +}
> > >>>>>>>>> +
> > >>>>>>>>>     static void vi_program_aspm(struct amdgpu_device *adev)
> > >>>>>>>>>     {
> > >>>>>>>>>         u32 data, data1, orig;
> > >>>>>>>>>         bool bL1SS = false;
> > >>>>>>>>>         bool bClkReqSupport = true;
> > >>>>>>>>>
> > >>>>>>>>> -     if (!amdgpu_device_should_use_aspm(adev))
> > >>>>>>>>> +     if (!amdgpu_device_should_use_aspm(adev) ||
> > >>>>>>>>> !aspm_support_quirk_check())
> > >>>>>>>>>                 return;
> > >>>>>>>>
> > >>>>>>>> Can users still forcefully enable ASPM with the parameter
> > >>>>>>>> `amdgpu.aspm`?
> > >>>>>>>>
> > >>>>> As Mario mentioned in a separate reply, we can't forcefully enable
> > >>>>> ASPM
> > >>>>> with the parameter 'amdgpu.aspm'.
> > >>>>
> > >>>> That would be a regression on systems where ASPM used to work. Hmm.
> I
> > >>>> guess, you could say, there are no such systems.
> > >>>>
> > >>>>>>>>>
> > >>>>>>>>>         if (adev->flags & AMD_IS_APU ||
> > >>>>>>>>
> > >>>>>>>> If I remember correctly, there were also newer cards, where ASPM
> > >>>>>>>> worked
> > >>>>>>>> with Intel Alder Lake, right? Can only the problematic
> > >>>>>>>> generations for
> > >>>>>>>> WX3200 and RX640 be excluded from ASPM?
> > >>>>>>>
> > >>>>>>> This patch only disables it for the generatioaon that was
> > >>>>>>> problematic.
> > >>>>>>
> > >>>>>> Could that please be made clear in the commit message summary, and
> > >>>>>> message?
> > >>>>>
> > >>>>> Are you ok with the commit messages below?
> > >>>>
> > >>>> Please change the commit message summary. Maybe:
> > >>>>
> > >>>> drm/amdgpu: VI: Disable ASPM on Intel Alder Lake based systems
> > >>>>
> > >>>>> Active State Power Management (ASPM) feature is enabled since
> > >>>>> kernel 5.14.
> > >>>>>
> > >>>>> There are some AMD GFX cards (such as WX3200 and RX640) that won't
> > >>>>> work
> > >>>>> with ASPM-enabled Intel Alder Lake based systems. Using these GFX
> > >>>>> cards as
> > >>>>> video/display output, Intel Alder Lake based systems will freeze after
> > >>>>> suspend/resume.
> > >>>>
> > >>>> Something like:
> > >>>>
> > >>>> On Intel Alder Lake based systems using ASPM with AMD GFX Volcanic
> > >>>> Islands (VI) cards, like WX3200 and RX640, graphics don't initialize
> > >>>> when resuming from S0ix(?).
> > >>>>
> > >>>>
> > >>>>> The issue was initially reported on one system (Dell Precision 3660
> > >>>>> with
> > >>>>> BIOS version 0.14.81), but was later confirmed to affect at least 4
> > >>>>> Alder
> > >>>>> Lake based systems.
> > >>>>
> > >>>> Which ones?
> > >>>>
> > >>>>> Add extra check to disable ASPM on Intel Alder Lake based systems with
> > >>>>> problematic generation GFX cards.
> > >>>>
> > >>>> ... with the problematic Volcanic Islands GFX cards.
> > >>>>
> > >>>>>>
> > >>>>>> Loosely related, is there a public (or internal issue) to analyze how
> > >>>>>> to get ASPM working for VI generation devices with Intel Alder Lake?
> > >>>>>
> > >>>>> As Alex mentioned, we need support from Intel. We don't have any
> > >>>>> update
> > >>>>> on that.
> > >>>>
> > >>>> It'd be great to get that fixed properly.
> > >>>>
> > >>>> Last thing, please don't hate me, does Linux log, that ASPM is
> > >>>> disabled?
> > >>>
> > >>> I'm not sure what gets logged at the platform level with respect to
> > >>> ASPM, but whether or not the driver enables ASPM is tied to whether
> > >>> ASPM is allowed at the platform level or not so if the platform
> > >>> indicates that ASPM is not supported, the driver won't enable it.  The
> > >>> driver does not log whether ASPM is enabled or not if that is what you
> > >>> are asking.  As to whether or not it should, it comes down to how much
> > >>> stuff is worth indiciating in the log.  The driver is already pretty
> > >>> chatty by driver standards.
> > >>
> > >> I specifically mean, Linux should log the quirks it applies. (As a
> > >> normal user, I'd also expect ASPM to work nowadays, so a message, that
> > >> it's disabled would help a lot.)
> > >
> > > In general rule we shouldn't generate additional log unless something
> > > went wrong with the system.
> >
> > Please run `dmesg` and see that your statement is false. That's what log
> > levels are for, and in your case, it would be at least error level.
> > Also, I claim, something indeed went wrong, because a quirk had to be
> > applied. So please add a notice log level, that ASPM gets disabled:
> >
> > Disable ASPM on Alder Lake with Volcanic Islands card due to resume
> > problems. System energy consumption might be higher than expected.
> 
> ASPM does not save that much power.  I doubt you could really measure
> it effectively without dedicated equipment.  Adding too many of these
> types of messages just leads to lots of useless bug reports.  Users
> see the message and file bugs.

IMO warn and error level definitely lead to bug reports.  I've seen plenty of
these filed even from Paul that the levels are wrong.

*If* there was a message added it should be info or notice.




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux