Re: drm/amdgpu: AMDGPU unusable since 6.12.1 and it looks like no one cares.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 21, 2025 at 12:50 AM Deucher, Alexander
<Alexander.Deucher@xxxxxxx> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Pavel
> > Nikulin
> > Sent: Sunday, January 19, 2025 2:29 PM
> > To: Alex Deucher <alexdeucher@xxxxxxxxx>
> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: drm/amdgpu: AMDGPU unusable since 6.12.1 and it looks like no one
> > cares.
> >
> > On Sun, Jan 19, 2025 at 5:53 PM Pavel Nikulin <pavel@xxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 6:08 PM Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 7:27 AM Pavel Nikulin <pavel@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > I think it persists as of 6.12.9 and today's firmware version from git.
> > > > >
> > > > > Hardware Asus um560.6
> > > > >
> > > > > It only happens when the AC adaptor is disconnected, and the
> > > > > screen refresh frequency is set to 120hz. It does not happen on
> > > > > any other refresh frequency, or when the charger is connected.
> > > > >
> > > > > It might be happening in Windows, but at much lower rate, like
> > > > > once in a month. The windows version might be applying some mitigations.
> > > > >
> > > > > Trying to catch what may be a prelude to hang never worked. It's
> > > > > just instahang, without panic, or anything. I cannot debug it
> > > > > without JTAGing the CPU, for which I have no equipment, nor am I
> > > > > sure if there are even JTAG headers exposed on the laptop motherboard.
> > > >
> > > > Please file a bug report and attach your dmesg output.
> > > > https://gitlab.freedesktop.org/drm/amd/-/issues
> > > >
> > > > Alex
> > >
> > > Unfortunately, what I would have would be the same dmesg as anyone
> > > else, however I have made following observations:
> > >
> > > Disabling PSR with debug mask makes it stable.
> > >
> > > If I set the refresh frequency to 60Hz, the lpddr memory clocks wiggle
> > > around 600mHz, and keep going back and forth (spread spectrum
> > > working.)
> > >
> > > If I switch to any other frequency, they stay stably at 937mhz (spread
> > > spectrum stops working,) and hangs happen.
> > >
> > > If I disconnect antennas from the MT7925 WiFi module, the issues are
> > > gone (as well as the wifi connectivity.)
> > >
> > > If I RFKILL the mt7925, both wifi, and bluetooth, it may still hang.
> > >
> > > If I nevertheless try to connect by putting the open laptop right next
> > > to the access point, the laptop will hang.
> > >
> > > But if I only try to do the same with 2.4GHz bluetooth mouse, it will
> > > continue to work. If I connect to 2.4GHz wifi, it will still hang
> > > after a few minutes.
> > >
> > > If I use the RTL8156BG based type-C usb dongle, and disconnect the
> > > power. It works stable. If I keep the connection going on type-C
> > > dongle, but switch on wifi, and set it as a default route, everything
> > > works stable, regardless if I connect to 5GHz or 2.4GHz wifi.
> > >
> > > If I try to put grounding tape around DP cables, and around the wifi
> > > module, it did not do anything conclusively.
> > >
> > > If I try to manually set the GPU performance to high, it marginally
> > > improves the hanging rate.
> > >
> > > DP 2.0, and 2.1 works on 600MHz, 1.4 on 300MHz, 1.2 on 150MHz
> > > depending on link speed, which I can't measure
> > >
> > > So, here is what think may have happened during the transition from
> > > 6.11 to 6.12
> > >
> > > - Something PCIE related (ASPM, other PCIE frequency/power settings)
> > > - Something PSR related (PSR raises memory clock rate, disables spread
> > > spectrum)
> > > - Something power related (undervoltage happens when type-C port, or
> > > power is not plugged in)
> > > - Something RF related (rendered less likely by it keeping working
> > > with type-C ethernet dongle plugged in, but not active)
> > >
> > > My guess it's an interplay in between PCIE, and PSR setting. Less
> > > likely, a hardware problem.
> > >
> > > I do remember, someone with a similar bug did dissect the breakage to
> > > a PCIE related commit.
> > >
> > > Do you want me to still put all of the above into a bug ticket on gitlab?
> >
> > What is stabilising the system:
> >
> > Following kernel command line parameters:
> > pcie_aspm=off
> > amdgpu_debugmask=0x200
> > amdgpu_debugmask=0x10
>
> There were a bunch of PSR related fixes that went into 6.13 (and cc'ed stable, so should eventually make their way to 6.12) last week.  Can you try an updated 6.13 kernel without those debug options?
>
> Alex
>

I am running git 9bffa1ad25b8b3b95d8f463e5c24dabe3c87d54d . Does
anyone here have a recent Ryzen based ASUS laptop, and hardware debug
gear?




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux