On 21.1.2023 5.55, stephen@xxxxxxxxxx wrote:
Hi all,
Today on my desktop running 6.1.6-arch1, I encountered the following
page fault. I looked for recent commits since then which may be related
but didn't find too much. I know it's a distro kernel, but it's recent
enough that I thought I'd share.
[242632.503611] BUG: unable to handle page fault for address: 000000008a49c504
[242632.510560] #PF: supervisor write access in kernel mode
[242632.515864] #PF: error_code(0x0002) - not-present page
[242632.521077] PGD 16d54b067 P4D 16d54b067 PUD 0
[242632.525599] Oops: 0002 [#1] PREEMPT SMP NOPTI
[242632.530039] CPU: 4 PID: 252721 Comm: cc1 Kdump: loaded Tainted: G OE 6.1.6-arch1-3-stephen #1 cca7bc1605c8162784590202a8c158ac5b4faea6
[242632.543478] Hardware name: ASUS System Product Name/TUF GAMING X570-PRO (WI-FI), BIOS 3603 03/20/2021
[242632.552757] RIP: 0010:xhci_irq+0x16fe/0x1fc0
[242632.557114] Code: 4c 8b 54 24 18 e9 c3 f5 ff ff 8d 45 e6 83 f8 01 0f 86 5e ff ff ff e9 4e ff ff ff 48 8b 44 24 18 44 89 c2 44 29 ca 48 8b 40 28 <89> 90 84 00 00 00 e9 ab f2 ff ff 49 8b 95 60 02 00 00 44 89 c9 48
[242632.575907] RSP: 0000:ffffa8061cf43df8 EFLAGS: 00010082
[242632.581206] RAX: 000000008a49c480 RBX: ffffa80600131730 RCX: 000000000000000d
[242632.588411] RDX: 00000000d313edc4 RSI: ffff957d86596460 RDI: ffff957d80efc260
[242632.595617] RBP: 000000000000000d R08: 0000000000000040 R09: 0000000000000039
[242632.602820] R10: 0000000000000000 R11: ffff957d865a7000 R12: ffff957d80f76140
[242632.610024] R13: ffff957d80efc000 R14: 0000000000000081 R15: ffffa80600131730
[242632.617227] FS: 00007ff4bfa38ac0(0000) GS:ffff958c6e900000(0000) knlGS:0000000000000000
[242632.625383] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[242632.631203] CR2: 000000008a49c504 CR3: 0000000405c38000 CR4: 0000000000750ee0
[242632.638411] PKRU: 55555554
[242632.641202] Call Trace:
[242632.643735] <TASK>
[242632.645924] __handle_irq_event_percpu+0x4d/0x190
[242632.650715] handle_irq_event+0x3b/0x80
[242632.654641] handle_edge_irq+0x9a/0x260
[242632.658569] __common_interrupt+0x46/0xa0
[242632.662668] common_interrupt+0x43/0xa0
[242632.666593] asm_common_interrupt+0x26/0x40
[242632.670866] RIP: 0033:0x636540
[242632.674027] Code: 05 dd 56 1e 02 0f b6 40 2c d0 e8 83 e0 01 c3 66 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 bd 56 1e 02 80 48 2b 10 c3 <f3> 0f 1e fa 48 8b 0d ad 56 1e 02 48 85 c9 75 10 eb 6e 66 0f 1f 44
[242632.692825] RSP: 002b:00007ffe2144df78 EFLAGS: 00000246
[242632.698133] RAX: 0000000000000006 RBX: 00007ff4ba906400 RCX: 00007ff4bed2ded0
[242632.705336] RDX: 0000000000000001 RSI: 0000000000000006 RDI: 0000000000000000
[242632.712541] RBP: 000000007ff231ad R08: 0000000000000001 R09: 0000000000000072
[242632.719747] R10: 0000000000000000 R11: 2675089da1f1c2c7 R12: 000000007ff231ad
[242632.726949] R13: 7ff231ad7ff231ad R14: 00007ff4bee11cf0 R15: 00007ff4bee11cf0
[242632.734161] </TASK>
[242632.736440] Modules linked in: uinput nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs rfcomm hid_logitech_hidpp mousedev joydev uvcvideo videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_v4l2 videobuf2_common snd_usbmidi_lib snd_rawmidi videodev vfat snd_seq_device fat hid_plantronics mc hid_logitech_dj cmac algif_hash algif_skcipher nct6775 af_alg nct6775_core hwmon_vid bnep intel_rapl_msr intel_rapl_common edac_mce_amd iwlmvm kvm_amd ccp snd_hda_codec_realtek snd_hda_codec_generic mac80211 snd_hda_codec_hdmi libarc4 kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi irqbypass btusb snd_hda_codec crct10dif_pclmul crc32_pclmul btrtl polyval_clmulni snd_hda_core polyval_generic btbcm snd_hwdep gf128mul iwlwifi btintel ghash_clmulni_intel snd_pcm btmtk sha512_ssse3 eeepc_wmi asus_wmi bluetooth aesni_intel snd_timer cfg80211 crypto_simd ledtrig[242632.503611] BUG: unable to handle page fault for address: 000000008a49c504
[242632.510560] #PF: supervisor write access in kernel mode
[242632.515864] #PF: error_code(0x0002) - not-present page
[242632.521077] PGD 16d54b067 P4D 16d54b067 PUD 0
[242632.525599] Oops: 0002 [#1] PREEMPT SMP NOPTI
[242632.530039] CPU: 4 PID: 252721 Comm: cc1 Kdump: loaded Tainted: G OE 6.1.6-arch1-3-stephen #1 cca7bc1605c8162784590202a8c158ac5b4faea6_audio snd cryptd uas sparse_keymap sp5100_tco zenpower(OE) platform_profile rapl usb_storage pcspkr ecdh_generic wmi_bmof
[242632.736509] soundcore i2c_piix4 igc rfkill acpi_cpufreq mac_hid dm_multipath dm_mod crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid nvme sr_mod nvme_core crc32c_intel cdrom xhci_pci nvme_common xhci_pci_renesas amdgpu drm_ttm_helper ttm video wmi gpu_sched drm_buddy drm_display_helper cec
[242632.853084] CR2: 000000008a49c504
I used drgn (https://github.com/osandov/drgn) to get a nicer stack
trace:
#0 process_bulk_intr_td (drivers/usb/host/xhci-ring.c:2502:38)
#1 handle_tx_event (drivers/usb/host/xhci-ring.c:2867:4)
#2 xhci_handle_event (drivers/usb/host/xhci-ring.c:2946:9)
#3 xhci_irq (drivers/usb/host/xhci-ring.c:3080:9)
#4 __handle_irq_event_percpu (kernel/irq/handle.c:158:9)
#5 handle_irq_event_percpu (kernel/irq/handle.c:193:11)
#6 handle_irq_event (kernel/irq/handle.c:210:8)
#7 handle_edge_irq (kernel/irq/chip.c:819:3)
#8 generic_handle_irq_desc (./include/linux/irqdesc.h:158:2)
#9 handle_irq (arch/x86/kernel/irq.c:231:3)
#10 __common_interrupt (arch/x86/kernel/irq.c:250:3)
#11 common_interrupt (arch/x86/kernel/irq.c:240:1)
#12 asm_common_interrupt+0x26/0x2b (./arch/x86/include/asm/idtentry.h:640)
The code in question:
static int process_bulk_intr_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
struct xhci_ring *ep_ring, struct xhci_td *td,
union xhci_trb *ep_trb, struct xhci_transfer_event *event)
{
/* many lines omitted */
if (ep_trb == td->last_trb)
td->urb->actual_length = requested - remaining;
I was able to use crash and some disassembly to confirm that RAX was the
register being dereferenced, and it has a value 000000008a49c504 which
was recorded in the page fault output. However, working backwards, I
found that the td and td->urb pointers were both valid. So I'm assuming
there's a race here somewhere. With crash, I found two other stacks on
CPU which seemed suspicious:
PID: 252783 TASK: ffff9586455a0000 CPU: 13 COMMAND: "cc1"
#0 [fffffe000030ce60] crash_nmi_callback at ffffffff84e657c6
#1 [fffffe000030ce68] nmi_handle at ffffffff84e321e1
#2 [fffffe000030ceb0] default_do_nmi at ffffffff85a1ee40
#3 [fffffe000030ced0] exc_nmi at ffffffff85a1f066
#4 [fffffe000030cef0] end_repeat_nmi at ffffffff85c01e3e
[exception RIP: native_queued_spin_lock_slowpath+109]
RIP: ffffffff85a3519d RSP: ffffa8061cffbc60 RFLAGS: 00000002
RAX: 0000000000000101 RBX: ffffffff874c25ec RCX: ffff957d8004c808
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff874c25ec
RBP: 0000000000000046 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa8061cffbd48
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#5 [ffffa8061cffbc60] native_queued_spin_lock_slowpath at ffffffff85a3519d
#6 [ffffa8061cffbc80] oops_begin at ffffffff84e31c9c
#7 [ffffa8061cffbc98] page_fault_oops at ffffffff84e81ec1
#8 [ffffa8061cffbd18] exc_page_fault at ffffffff85a21774
#9 [ffffa8061cffbd40] asm_exc_page_fault at ffffffff85c01266
[exception RIP: xhci_irq+5886]
RIP: ffffffff856cd58e RSP: ffffa8061cffbdf8 RFLAGS: 00010003
RAX: 00000000edf2e780 RBX: ffffa80600209ca0 RCX: 000000000000000d
RDX: 0000000036bd10c4 RSI: ffff957d87169ae0 RDI: ffff957d9ebae260
RBP: 000000000000000d R8: 0000000000000020 R9: 0000000000000011
R10: 0000000000000000 R11: ffff957e7ccf5880 R12: ffff957e11552380
R13: ffff957d9ebae000 R14: 0000000000000081 R15: ffffa80600209ca0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#10 [ffffa8061cffbe98] __handle_irq_event_percpu at ffffffff84f1fdbd
#11 [ffffa8061cffbec8] handle_irq_event at ffffffff84f1ff8b
#12 [ffffa8061cffbee8] handle_edge_irq at ffffffff84f24afa
#13 [ffffa8061cffbf08] __common_interrupt at ffffffff84e30556
#14 [ffffa8061cffbf30] common_interrupt at ffffffff85a1e9b3
#15 [ffffa8061cffbf50] asm_common_interrupt at ffffffff85c014e6
RIP: 000000000082ce8d RSP: 00007ffc76405600 RFLAGS: 00000296
RAX: 000000000000006c RBX: 00007fe2941b88c0 RCX: 00000000038ef8d7
RDX: 00000000a0d7e714 RSI: 0000000000000011 RDI: 00000000038ef8d0
RBP: 00007fe2941b88c0 R8: 00000000257c7983 R9: 000000007ff3facb
R10: 00007fe29837c000 R11: 00000000257c7983 R12: 00000000ae574838
R13: 00007fe297708c90 R14: 00007ffc76405800 R15: 00000000038ef8d0
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
PID: 640 TASK: ffff957d9e924180 CPU: 31 COMMAND: "usb-storage"
#0 [fffffe0000732e60] crash_nmi_callback at ffffffff84e657c6
#1 [fffffe0000732e68] nmi_handle at ffffffff84e321e1
#2 [fffffe0000732eb0] default_do_nmi at ffffffff85a1ee40
#3 [fffffe0000732ed0] exc_nmi at ffffffff85a1f066
#4 [fffffe0000732ef0] end_repeat_nmi at ffffffff85c01e3e
[exception RIP: native_queued_spin_lock_slowpath+111]
RIP: ffffffff85a3519f RSP: ffffa8060142bc00 RFLAGS: 00000002
RAX: 0000000000000101 RBX: ffff957d80efc2ac RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff957d80efc2ac
RBP: 0000000000000001 R8: 0000000000000001 R9: ffff957dbff46980
R10: 000000000000001f R11: ffff957d86617a40 R12: ffff957d80efc2ac
R13: ffff957d81dae840 R14: ffff957dbff46980 R15: ffff957d80efc010
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffa8060142bc00] native_queued_spin_lock_slowpath at ffffffff85a3519f
#6 [ffffa8060142bc20] _raw_spin_lock_irqsave at ffffffff85a34a8d
#7 [ffffa8060142bc30] xhci_urb_enqueue at ffffffff856c2e89
#8 [ffffa8060142bc90] usb_hcd_submit_urb at ffffffff8568e7e4
#9 [ffffa8060142bd50] usb_stor_msg_common at ffffffffc13912ee [usb_storage]
#10 [ffffa8060142bd90] usb_stor_bulk_transfer_buf at ffffffffc1391685 [usb_storage]
#11 [ffffa8060142bdb8] usb_stor_Bulk_transport at ffffffffc1392089 [usb_storage]
#12 [ffffa8060142be00] usb_stor_invoke_transport at ffffffffc1392563 [usb_storage]
#13 [ffffa8060142beb8] usb_stor_control_thread at ffffffffc139332c [usb_storage]
#14 [ffffa8060142bf18] kthread at ffffffff84ec73ce
#15 [ffffa8060142bf50] ret_from_fork at ffffffff84e03572
CPU 13 had the exact same page fault at the same instruction, but with a
different RAX value. CPU 31 has a suspicious looking stack (given that
we faulted on a field td->urb), but I don't know anything in this
subsystem very well.
I have the core dump and kernel info and would be happy to fetch more
information if yo uwanted. I don't think it's (easily) reproducible.
I just wanted to share in case this catches anyone's eye. I admit
between some amdgpu crashes and this, my desktop hasn't been as stable
as I'd like, so I'll be reporting any other crashes if they happen.
Thanks for this report.
In normal cases the URB should not be altered until xhci driver gives it back,
and this normally happens after xHC handled the last transfer block (TRB) of this URB.
In some special cases (short transfer or error on multi TRB URB) I think we might give
back the URB before the last TRB is handled, but TD should be removed from list
of queued TDs, so event handling shouldn't reach process_intr_bulk_td() and dereference
td->urb. Unless there's a bug.
Were there any usb related error messages in dmesg before this?
Thanks
Mathias