On 7/17/2024 4:51 PM, Linux regression tracking (Thorsten Leemhuis) wrote: > On 15.07.24 06:39, Chris Hixon wrote: >> System: HP ENVY x360 Convertible 15-ds1xxx; AMD Ryzen 7 4700U with >> Radeon Graphics >> >> Problem commits (introduced in v6.9-rc1): >> 6296562f30b1 HID: amd_sfh: Extend MP2 register access to SFH >> 2105e8e00da4 HID: amd_sfh: Improve boot time when SFH is available >>> It appears amd_sfh commits 6296562f30b1 and 2105e8e00da4 correlate with >> some form of memory/page corruption. > Hi! From a quick search on lore it looks like Basavaraj Natikar who > authored those two commits is inactive since a few days. This is totally > fine, but given the nature of the problem slightly unfortunate. That's > why I'm trying to raise awareness to this report by adding the > subsystems maintainers, a few lists, and a few people to the list of > recipients that were involved in the submission of those two patches. > With a bit of luck somebody might be able to help out. Ciao, Thorsten > >> On my system, this typically >> presents itself as a page dump followed by BTRFS errors, usually >> involving "corrupt leaf" (see dmesg output below); often the BTRFS >> filesystem becomes read-only afterwards. Note that the underlying NVME >> disk seems fine, and the BTRFS filesystem does not actually appear to be >> corrupt when booted/checked from kernels without this bug (no BTRFS >> errors or I/O errors reported on non-problem kernels). >> >> I have no problems when I blacklist the amd_sfh module (any kernel >> version), or revert both commits 6296562f30b1 and 2105e8e00da4 (on >> stable, linux-6.9.y). I have no problems on any recent linux-mainline >> (v6.10{,-rc*}) when reverting these two commits (in addition to >> reverting 7902ec988a9a and 6856f079cd45 to successfully build the >> kernel). I have had no problems with any 6.6.y, v6.7.y, or v6.8.y version. >> >> It is curious BTRFS always seems involved, but problems go away with >> these amd_sfh commits reverted (or amd_afh disabled). >> >> Further notes: >> >> I have not specifically used the amd_sfh module for anything. As far >> I've been able to determine, my system has the "Sensor Fusion Hub" mp2 >> chip, but has no supported sensors/sub-devices (or I need to do >> something to enable them), (or there is an error while detecting >> sensors?). All logs I've checked contain something like: >> >> Jul 09 04:14:37 arch kernel: pcie_mp2_amd 0000:04:00.7: enabling device >> (0000 -> 0002) >> Jul 09 04:15:07 arch kernel: pcie_mp2_amd 0000:04:00.7: Failed to >> discover, sensors not enabled is 0 >> Jul 09 04:15:07 arch kernel: pcie_mp2_amd 0000:04:00.7: >> amd_sfh_hid_client_init failed err -95 >> >> Excerpt from lshw: >> *-generic:1 UNCLAIMED >> description: Signal processing controller >> product: Sensor Fusion Hub >> vendor: Advanced Micro Devices, Inc. [AMD] >> physical id: 0.7 >> bus info: pci@0000:04:00.7 >> version: 00 >> width: 32 bits >> clock: 33MHz >> capabilities: pm pciexpress msi msix cap_list >> configuration: latency=0 >> resources: memory:fe000000-fe0fffff >> memory:fe4cc000-fe4cdfff Could you please check with the latest version, including the patch below? https://lore.kernel.org/all/20240718111616.3012155-1-Basavaraj.Natikar@xxxxxxx/ Thanks, -- Basavaraj >> >> How I tracked down the problem commits: >> >> I was not able to successfully "git bisect" this bug - I seemed to run >> into a mess of unrelated problems/errors that sent me down a rabbit hole >> chasing who knows what. I had already manually narrowed down the bug to >> amd_sfh by blacklisting modules, so I reverted each >> drivers/hid/amd-sfh-hid commit on the stable linux-6.9.y branch (v6.9.8 >> known "bad"), back to v6.6 (known "good"), and then manually bisected >> the revert commits, landing on "HID: amd_sfh: Improve boot time when SFH >> is available" (2105e8e00da4) as the first "bad" commit. >> >> I wanted to be able to test with only the "bad" commit(s) removed; it >> turns out 6296562f30b1 ("HID: amd_sfh: Extend MP2 register access to >> SFH") needs to be reverted to do that. Everything seems fine with these >> two commits reverted (again, this in on the stable linux-6.9.y branch). >> >> When testing, "bad" commits usually quickly display some variation of >> the page dump/BTRFS errors, similar to the dmesg output below. I >> consider commits "good" if the system survives "stress-ng --all 2 >> --vm-bytes 50% --minimize --syslog --status 10 -t 5m" (run as a non-root >> user), which was usually followed by building the next test kernel. The >> "bad" commits often show errors before I even get to the stress test. >> >> Examples of error messages from dmesg: >> >> [ 653.364343] page: refcount:4 mapcount:0 mapping:00000000b159289f >> index:0x585a7cec pfn:0x10b5c1 >> [ 653.364353] memcg:ffff8f2600918000 >> [ 653.364354] aops:btree_aops ino:1 >> [ 653.364358] flags: >> 0x17ffffd000802a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1fffff) >> [ 653.364361] page_type: 0xffffffff() >> [ 653.364363] raw: 0017ffffd000802a fffff1da87ee3288 fffff1da842d70c8 >> ffff8f260c719458 >> [ 653.364365] raw: 00000000585a7cec ffff8f26cd09e0f0 00000004ffffffff >> ffff8f2600918000 >> [ 653.364366] page dumped because: eb page dump >> [ 653.364367] BTRFS critical (device dm-0): corrupt leaf: root=7 >> block=6071604133888 slot=159, unexpected item end, have 2768254010 >> expect 13379 >> [ 653.364371] BTRFS info (device dm-0): leaf 6071604133888 gen 679995 >> total ptrs 353 free space 322 owner 7 >> [ 653.364373] item 0 key (18446744073709551606 128 1062871883776) >> itemoff 16271 itemsize 12 >> [ 653.364375] item 1 key (18446744073709551606 128 1062871896064) >> itemoff 16263 itemsize 8 >> [ 653.364376] item 2 key (18446744073709551606 128 1062871904256) >> itemoff 16255 itemsize 8 >> ... >> [ 653.364762] item 350 key (18446744073709551606 128 1062879260672) >> itemoff 9227 itemsize 12 >> [ 653.364763] item 351 key (18446744073709551606 128 1062879272960) >> itemoff 9223 itemsize 4 >> [ 653.364764] item 352 key (18446744073709551606 128 1062879277056) >> itemoff 9147 itemsize 76 >> [ 653.364766] BTRFS error (device dm-0): block=6071604133888 write time >> tree block corruption detected >> [ 653.375440] BTRFS: error (device dm-0) in >> btrfs_commit_transaction:2511: errno=-5 IO failure (Error while writing >> out transaction) >> [ 653.375453] BTRFS info (device dm-0 state E): forced readonly >> [ 653.375458] BTRFS warning (device dm-0 state E): Skipping commit of >> aborted transaction. >> [ 653.375461] BTRFS error (device dm-0 state EA): Transaction aborted >> (error -5) >> [ 653.375465] BTRFS: error (device dm-0 state EA) in >> cleanup_transaction:2005: errno=-5 IO failure >> [ 653.375582] BTRFS warning (device dm-0 state EA): Skipping commit of >> aborted transaction. >> [ 653.375586] BTRFS: error (device dm-0 state EA) in >> cleanup_transaction:2005: errno=-5 IO failure >> >> Another example: >> >> [ 5478.134046] page: refcount:4 mapcount:0 mapping:0000000010080c01 >> index:0x5459ff30 pfn:0x168c7f >> [ 5478.134054] memcg:ffff89c240988000 >> [ 5478.134056] aops:btree_aops ino:1 >> [ 5478.134061] flags: >> 0x17ffffd800802a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1fffff) >> [ 5478.134064] page_type: 0xffffffff() >> [ 5478.134066] raw: 0017ffffd800802a ffffcc5d043e2bc8 ffffcc5d05a08c88 >> ffff89c249968338 >> [ 5478.134068] raw: 000000005459ff30 ffff89c246fa22d0 00000004ffffffff >> ffff89c240988000 >> [ 5478.134069] page dumped because: eb page dump >> [ 5478.134071] BTRFS critical (device dm-0): corrupt leaf: root=2161 >> block=5796594384896 slot=84 ino=2434728, invalid inode generation: has >> 72057594122450740 expect (0, 664473] >> [ 5478.134075] BTRFS info (device dm-0): leaf 5796594384896 gen 664472 >> total ptrs 120 free space 1223 owner 2161 >> [ 5478.134077] item 0 key (2434713 24 3817753667) itemoff 16210 >> itemsize 73 >> [ 5478.134078] item 1 key (2434713 108 0) itemoff 15359 itemsize 851 >> [ 5478.134080] inline extent data size 830 >> [ 5478.134081] item 2 key (2434714 1 0) itemoff 15199 itemsize 160 >> [ 5478.134082] inode generation 636724 size 758 mode 100644 >> [ 5478.134083] item 3 key (2434714 12 2348495) itemoff 15181 itemsize 18 >> ... >> [ 5478.134242] item 117 key (2434733 108 0) itemoff 4398 itemsize 329 >> [ 5478.134243] inline extent data size 308 >> [ 5478.134244] item 118 key (2434734 1 0) itemoff 4238 itemsize 160 >> [ 5478.134245] inode generation 636724 size 30 mode 40755 >> [ 5478.134245] item 119 key (2434734 12 2434375) itemoff 4223 itemsize 15 >> [ 5478.134247] BTRFS error (device dm-0): block=5796594384896 write time >> tree block corruption detected >> [ 5478.263726] BTRFS: error (device dm-0) in >> btrfs_commit_transaction:2511: errno=-5 IO failure (Error while writing >> out transaction) >> [ 5478.263733] BTRFS info (device dm-0 state E): forced readonly >> [ 5478.263736] BTRFS warning (device dm-0 state E): Skipping commit of >> aborted transaction. >> [ 5478.263737] BTRFS error (device dm-0 state EA): Transaction aborted >> (error -5) >> [ 5478.263739] BTRFS: error (device dm-0 state EA) in >> cleanup_transaction:2005: errno=-5 IO failure >> [ 5478.264582] BTRFS warning (device dm-0 state EA): Skipping commit of >> aborted transaction. >> [ 5478.264595] BTRFS: error (device dm-0 state EA) in >> cleanup_transaction:2005: errno=-5 IO failure > #regzbot ^introduced: 6296562f30b1 > #regzbot summary: hid: amd_sfh: memory/page corruption correlated with > 6296562f30b1 or 2105e8e00da4 > #regzbot ignore-activity