BUG in amd_sfh_get_report

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I just rebooted my server this morning and was greeted by this bug:
--------------
[    9.251535] BUG: unable to handle page fault for address: ffffffff85600000
[    9.254214] #PF: supervisor read access in kernel mode
[    9.257295] #PF: error_code(0x0000) - not-present page
[    9.259928] PGD 181a25067 P4D 181a25067 PUD 181a26063 PMD 0 
[    9.259940] Oops: 0000 [#1] PREEMPT SMP NOPTI
[    9.259945] CPU: 11 PID: 723 Comm: (udev-worker) Tainted: P           O       6.6.42 #1-NixOS
[    9.259949] Hardware name:  /Default string, BIOS FP7R2_B5D_04A.45 06/14/2023
[    9.259950] RIP: 0010:amd_sfh_get_report+0x43/0x140 [amd_sfh]
[    9.272030] Code: 00 48 8b 68 08 8b 45 10 85 c0 0f 84 d9 00 00 00 49 89 fc 41 89 f6 41 89 d7 31 db eb 0d 48 83 c3 01 48 39 c3 0f 84 bf 00 00 00 <4c> 39 64 dd 68 75 ec 48 8b 44 24 30 48 33 05 92 d3 c7 c2 be c0 0d
[    9.272037] RSP: 0018:ffffc90000f8fb40 EFLAGS: 00010287
[    9.272041] RAX: 0000000048000000 RBX: 0000000000545c2d RCX: 0000000000000000
[    9.272043] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff88812ce84000
[    9.272045] RBP: ffffffff82bd1e30 R08: ffffc90000f8fbd8 R09: ffffc90000f8fbd8
[    9.272046] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812ce84000
[    9.272047] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
[    9.272049] FS:  00007f7175005100(0000) GS:ffff88838ff80000(0000) knlGS:0000000000000000
[    9.272050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.272051] CR2: ffffffff85600000 CR3: 0000000117900000 CR4: 0000000000f50ee0
[    9.297338] PKRU: 55555554
[    9.297345] Call Trace:
[    9.297353]  <TASK>
[    9.297360]  ? __die+0x23/0x80
[    9.297371]  ? page_fault_oops+0x171/0x500
[    9.297376]  ? srso_alias_return_thunk+0x5/0xfbef5
[    9.297382]  ? srso_alias_return_thunk+0x5/0xfbef5
[    9.297384]  ? search_bpf_extables+0x5f/0x90
[    9.319385]  ? exc_page_fault+0x158/0x160
[    9.319397]  ? asm_exc_page_fault+0x26/0x30
[    9.319403]  ? __pfx_css_release+0x10/0x10
[    9.319417]  ? amd_sfh_get_report+0x43/0x140 [amd_sfh]
[    9.319426]  amdtp_hid_request+0x3e/0x60 [amd_sfh]
[    9.319435]  sensor_hub_get_feature+0xad/0x180 [hid_sensor_hub]
[    9.319448]  hid_sensor_parse_common_attributes+0x217/0x320 [hid_sensor_iio_common]
[    9.319457]  hid_accel_3d_probe+0xb7/0x320 [hid_sensor_accel_3d]
[    9.319463]  ? srso_alias_return_thunk+0x5/0xfbef5
[    9.319466]  platform_probe+0x44/0xa0
[    9.319474]  really_probe+0x1ac/0x3f0
[    9.319478]  ? __pfx___driver_attach+0x10/0x10
[    9.319480]  __driver_probe_device+0x78/0x170
[    9.319482]  driver_probe_device+0x1f/0xa0
[    9.319485]  __driver_attach+0xea/0x1e0
[    9.319487]  bus_for_each_dev+0x8c/0xe0
[    9.319493]  bus_add_driver+0x14d/0x280
[    9.319497]  driver_register+0x5d/0x120
[    9.319500]  ? __pfx_hid_accel_3d_platform_driver_init+0x10/0x10 [hid_sensor_accel_3d]
[    9.319504]  do_one_initcall+0x5d/0x330
[    9.319513]  do_init_module+0x90/0x270
[    9.319517]  __do_sys_init_module+0x18a/0x1c0
[    9.319520]  ? srso_alias_return_thunk+0x5/0xfbef5
[    9.319525]  do_syscall_64+0x39/0x90
[    9.319530]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[    9.319534] RIP: 0033:0x7f7174b1a61e
[    9.319579] Code: 48 8b 0d 0d 68 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d da 67 0d 00 f7 d8 64 89 01 48
[    9.319581] RSP: 002b:00007ffda3136658 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    9.319584] RAX: ffffffffffffffda RBX: 000055a3f7725040 RCX: 00007f7174b1a61e
[    9.319585] RDX: 00007f7175181304 RSI: 0000000000007fd0 RDI: 000055a3f773edb0
[    9.319587] RBP: 000055a3f773edb0 R08: 0000000000000000 R09: 0000000000000000
[    9.319588] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7175181304
[    9.319589] R13: 0000000000020000 R14: 000055a3f771fa40 R15: 0000000000000000
[    9.319593]  </TASK>
[    9.319594] Modules linked in: hid_sensor_gyro_3d hid_sensor_magn_3d snd_sof_amd_renoir intel_rapl_msr(+) edac_core nls_iso8859_1 hid_sensor_accel_3d(+) snd_sof_amd_acp rtw88_core intel_rapl_common hid_sensor_trigger nls_cp437 industrialio_triggered_buffer snd_sof_pci kfifo_buf snd_sof_xtensa_dsp hid_sensor_iio_common vfat industrialio snd_sof fat kvm_amd mac80211 snd_sof_utils hid_sensor_hub snd_hda_codec_realtek snd_hda_codec_hdmi kvm snd_hda_codec_generic drm_exec snd_soc_core amdxcp snd_usb_audio drm_buddy irqbypass snd_compress snd_hda_intel crc32_pclmul eeepc_wmi(-) polyval_clmulni ac97_bus snd_intel_dspcfg gpu_sched btusb asus_wmi snd_pcm_dmaengine polyval_generic snd_intel_sdw_acpi snd_usbmidi_lib gf128mul drm_suballoc_helper btrtl battery snd_pci_ps ghash_clmulni_intel snd_ump drm_ttm_helper snd_hda_codec snd_rpl_pci_acp6x btintel snd_rawmidi ttm btbcm snd_acp_pci ledtrig_audio input_leds sha512_ssse3 snd_seq_device snd_hda_core sparse_keymap btmtk evdev wmi_bmof snd_acp_legacy_common sha256_ssse3
[    9.319659]  drm_display_helper mc led_class snd_pci_acp6x snd_hwdep sha1_ssse3 cfg80211 mac_hid bluetooth r8169 aesni_intel snd_pcm cec crypto_simd cryptd snd_pci_acp5x i2c_algo_bit sp5100_tco snd_rn_pci_acp3x realtek snd_timer tpm_crb snd_acp_config mdio_devres ecdh_generic uas watchdog video snd tpm_tis amd_pmf snd_soc_acpi tiny_power_button rfkill ecc rapl usb_storage crc16 libphy libarc4 soundcore k10temp amd_sfh(+) i2c_piix4 ccp snd_pci_acp3x backlight wmi thermal tpm_tis_core platform_profile button acpi_tad serio_raw zfs(PO+) nfsd spl(O) tun tap auth_rpcgss macvlan nfs_acl lockd bridge grace stp llc fuse sunrpc efi_pstore configfs nfnetlink zram efivarfs tpm rng_core dmi_sysfs ip_tables x_tables autofs4 hid_generic sd_mod usbhid atkbd libps2 hid vivaldi_fmap ahci libahci libata nvme xhci_pci xhci_pci_renesas nvme_core scsi_mod xhci_hcd nvme_common t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul scsi_common crct10dif_common rtc_cmos i8042 serio dm_mod dax btrfs blake2b_generic
[    9.319743]  libcrc32c crc32c_generic crc32c_intel xor raid6_pq
[    9.319749] CR2: ffffffff85600000
[    9.319752] ---[ end trace 0000000000000000 ]---
[    9.444407] RIP: 0010:amd_sfh_get_report+0x43/0x140 [amd_sfh]
[    9.563701] Code: 00 48 8b 68 08 8b 45 10 85 c0 0f 84 d9 00 00 00 49 89 fc 41 89 f6 41 89 d7 31 db eb 0d 48 83 c3 01 48 39 c3 0f 84 bf 00 00 00 <4c> 39 64 dd 68 75 ec 48 8b 44 24 30 48 33 05 92 d3 c7 c2 be c0 0d
[    9.563707] RSP: 0018:ffffc90000f8fb40 EFLAGS: 00010287
[    9.563710] RAX: 0000000048000000 RBX: 0000000000545c2d RCX: 0000000000000000
[    9.563711] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff88812ce84000
[    9.563713] RBP: ffffffff82bd1e30 R08: ffffc90000f8fbd8 R09: ffffc90000f8fbd8
[    9.563714] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812ce84000
[    9.563715] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
[    9.563716] FS:  00007f7175005100(0000) GS:ffff88838ff80000(0000) knlGS:0000000000000000
[    9.594612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.594617] CR2: ffffffff85600000 CR3: 0000000117900000 CR4: 0000000000f50ee0
[    9.594619] PKRU: 55555554
[    9.594622] note: (udev-worker)[723] exited with irqs disabled
------

Thanksfully the system was able to boot but I'm not quite sure if it's
related udev got a thread stuck trying to remove the device (probably
the thread died with some lock held) and everything was very slow;
something else crashed again shortly after so I didn't have time to
investigate the bugged state all that much.

- 6.6.42 kernel from nixos unstable
- CPU identified as AMD Ryzen 7 7735HS with Radeon Graphics in
/proc/cpuinfo
- this card:
05:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]


I'd offer to test mainline but I cannot reboot this machine easily, and
passing the card to qemu unfortunately didn't reproduce
(amd_sfh_dis_sts_v2() != 0 so it doesn't load, and skipping that check
doesn't help), so I'm afraid I won't be of much help with further
debugging but hopefully it'll give a starting point..

I unfortunately have no way to easily get debug infos but a quick look
at the disassembly hints that amd_sfh_get_report+0x43 is the
access to cli_data->hid_sensor_hubs[i]:
('i++')
     aa6:       48 83 c3 01             add    $0x1,%rbx
('i < cli_data->num_hid_devices' check)
     aaa:       48 39 c3                cmp    %rax,%rbx
     aad:       0f 84 bf 00 00 00       je     b72 <amd_sfh_get_report+0x102>
(amd_sfh_get_report+0x43;
'if (cli_data->hid_sensor_hubs[i] == hid) {'
0x68 is the offset of hid_sensor_hubs in struct amdtp_cl_data;
the registers / bug address also match rbp+8*rbx+0x68 = ffffffff85600000)
     ab3:       4c 39 64 dd 68          cmp    %r12,0x68(%rbp,%rbx,8)
     ab8:       75 ec                   jne    aa6 <amd_sfh_get_report+0x36>

     ab3:       4c 39 64 dd 68          cmp    %r12,0x68(%rbp,%rbx,8)

So the problem would be that num_hid_device somehow holds 0x48000000 and
that let i run free to way too high values?
I can't fault num_hid_devices init for a given cli_data in
amd_sfh_hid_client_init, but amd_sfh_get_report() might have been called
on something that's not quite valid yet or is in the process of being
removed?...
I'm sorry my previous reboot was a while ago so I can't even tell if
it's reproducible, but the code hasn't changed all that much recently so
this is probably a race condition so that'd explain I hadn't seen this
before...

(... And I honestly have no idea what this driver is all for even after
having looked at the code so I've just blacklisted the module for now,
good luck!)
-- 
Dominique Martinet | Asmadeus





[Index of Archives]     [Linux Media Devel]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Linux Wireless Networking]     [Linux Omap]

  Powered by Linux