On 10/2/2024, 6:29:59 AM, "Linux regression tracking (Thorsten Leemhuis)" wrote: > [CCing Richard, who apparently faces the same problem according to a > recent comment in the bugzilla ticket mentioned earlier: > https://bugzilla.kernel.org/show_bug.cgi?id=219331#c8 > > CCing Mario, who might be interested in this and is a good contact when > it comes to issues with AMD stuff like this. > > CCing the Btrfs list as JFYI, as all three reporters afaics see Btrfs > misbehavior or corruptions due to this. > > Considered to bring Linus in, but decided to wait a bit before doing so.] This patch from Basavaraj Natikar seems to solve the issue for me: https://lore.kernel.org/linux-input/20241003160454.3017229-1-Basavaraj.Natikar@xxxxxxx/ Tested-by: Chris Hixon <linux-kernel-bugs@xxxxxxxxxxxxx> My original report: https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/ Reported-by: Chris Hixon <linux-kernel-bugs@xxxxxxxxxxxxx> Thanks! > > On 01.10.24 23:40, Chris Hixon wrote: >> On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote: > >>> Basavaraj Natikar, I noticed a report about a regression in >>> bugzilla.kernel.org that appears to be caused by a change of yours: >>> >>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available") >>> [v6.9-rc1] >>> >>> As many (most?) kernel developers don't keep an eye on the bug tracker, >>> I decided to write this mail. To quote from >>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 : >>> >>>> I am getting bad page map errors on kernel version 6.9 or newer. >>>> They always appear within a few minutes of the system being on, if >>>> not immediately upon booting. My system is a Dell Inspiron 7405. >> [...] >>>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing. >>>> [ 23.580724] rfkill: input handler enabled >>>> [ 25.652067] rfkill: input handler disabled >> >>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0 >>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95 >> >> No sensors detected - do we all have that in common? > > Skyler, Richard? > >>>> [...] >>> See the ticket for more details and the bisection result. Skyler, the >>> reporter (CCed), later also added: >>> >>>> Occasionally I will not get the usual bad page map error, but >>>> instead some BTRFS errors followed by the file system going read-only. >>> >>> Note, we had and earlier regression caused by this change reported by >>> Chris Hixon that maybe was not solved completely: >>> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/ >> >> This looks like the same issue I reported. > > And sounds a lot like what Richard sees, who also sees disk corruption > with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ). > >>> Chris Hixon: do you still encounter errors, or was your issue >>> resolved/vanished somehow? >> >> I still encounter errors with every kernel/patch I've tested. I've blacklisted >> the amd_sfh module as a workaround, but when the module is inserted, a crash >> similar to those reported will happen soon after the (45 second?) >> detection/initialization timeout. It seems to affect whatever part of the >> kernel next becomes active. I've had disk corruption as well, when BTRFS is >> affected by the memory corruption, > > Skyler, did you see btrfs disk corruption as well, just like Chris and > Richard did? > >> so I've ended up testing on a USB stick I >> can reformat if necessary. I haven't tested new patches/kernels in a while >> though. I'll get back to you after I've tried the latest mainline. Also note >> that I've tried Fedora Rawhide's debug kernel, > > From what I see it seems all three of you are using Fedora. Wonder if > that is a coincidence. Note: I don't think it's a Fedora issue. I've had the problem on multiple distros, with any kernel >= 6.9 - anything with the "bad" commit. >> which has a ton of debugging >> options including KASAN, but nothing seems to point the finger at something >> originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh >> chip) is corrupting memory somehow after some misstep in >> initialization/de-initialization? Also if you look at my report, you'll see I >> have no devices/sensors detected by amd_sfh - I wonder if other reporters all >> have this in common? (noted in dmesg output above from another user) > > Given that Basavaraj Natikar never really addressed Chris earlier report > from months ago and the severeness of the problem I'd wonder if we > should revert the culprit to resolve this quickly, unless some proper > fix comes into sight soon. Sadly from a quick look that would require > multiple reverts afaics. :-/ > > Ciao, Thorsten >