Lockup on USB and block devices

Steven Rostedt <rostedt@xxxxxxxxxxx> · Tue, 1 Oct 2019 13:44:30 -0400

Not sure who to blame, but my server locked up when upgraded (accessing
volume group information), and echoing in "w" into sysrq-trigger showed
a bit of information.

First, looking at my dmesg I see that my usb-storage is hung up, for
whatever reason. Thus, this could be the source of all issues.

[5434447.145737] INFO: task usb-storage:32246 blocked for more than 120 seconds.
[5434447.145740]       Not tainted 5.2.4-custom #4

(BTW, I was upgrading to my 5.2.17 kernel when this happened)

[5434447.145741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5434447.145743] usb-storage     D    0 32246      2 0x80004000
[5434447.145745] Call Trace:
[5434447.145749]  ? __schedule+0x1e8/0x600
[5434447.145752]  schedule+0x29/0x90
[5434447.145755]  schedule_timeout+0x208/0x300
[5434447.145765]  ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
[5434447.145773]  ? usb_alloc_urb+0x23/0x70 [usbcore]
[5434447.145782]  ? usb_sg_init+0x92/0x2b0 [usbcore]
[5434447.145791]  ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
[5434447.145795]  ? __switch_to_asm+0x34/0x70
[5434447.145798]  wait_for_completion+0x111/0x180
[5434447.145800]  ? wake_up_q+0x70/0x70
[5434447.145809]  usb_sg_wait+0xfa/0x150 [usbcore]
[5434447.145814]  usb_stor_bulk_transfer_sglist.part.4+0x64/0xb0 [usb_storage]
[5434447.145818]  usb_stor_bulk_srb+0x49/0x80 [usb_storage]
[5434447.145821]  usb_stor_Bulk_transport+0x167/0x3e0 [usb_storage]
[5434447.145824]  ? schedule+0x29/0x90
[5434447.145828]  ? usb_stor_disconnect+0xb0/0xb0 [usb_storage]
[5434447.145832]  usb_stor_invoke_transport+0x3a/0x4e0 [usb_storage]
[5434447.145835]  ? wait_for_completion_interruptible+0x12d/0x1d0
[5434447.145837]  ? wake_up_q+0x70/0x70
[5434447.145841]  usb_stor_control_thread+0x1c5/0x270 [usb_storage]
[5434447.145845]  kthread+0x116/0x130
[5434447.145847]  ? kthread_create_worker_on_cpu+0x70/0x70
[5434447.145851]  ret_from_fork+0x35/0x40

But then this messed up with this I presume:

[5434567.980526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5434567.980529] pool            D    0 32364      1 0x00000000
[5434567.980532] Call Trace:
[5434567.980543]  ? __schedule+0x1e8/0x600
[5434567.980547]  schedule+0x29/0x90
[5434567.980558]  scsi_block_when_processing_errors+0xb7/0xc0 [scsi_mod]
[5434567.980564]  ? finish_wait+0x80/0x80
[5434567.980568]  sd_open+0x49/0x170 [sd_mod]
[5434567.980574]  __blkdev_get+0x45a/0x510
[5434567.980577]  ? bd_acquire+0xb0/0xb0
[5434567.980579]  blkdev_get+0x108/0x300
[5434567.980582]  ? bd_acquire+0xb0/0xb0
[5434567.980588]  do_dentry_open+0x13a/0x370
[5434567.980593]  path_openat+0x2c6/0x1450
[5434567.980598]  ? part_stat_show+0x427/0x450
[5434567.980603]  do_filp_open+0x93/0x100
[5434567.980608]  ? __check_object_size+0x15d/0x189
[5434567.980612]  do_sys_open+0x186/0x220
[5434567.980616]  do_syscall_64+0x4e/0x100
[5434567.980621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5434567.980625] RIP: 0033:0x7f8b4b034d0e
[5434567.980632] Code: Bad RIP value.
[5434567.980633] RSP: 002b:00007f8b49a389d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[5434567.980636] RAX: ffffffffffffffda RBX: 00007f8b3c035490 RCX: 00007f8b4b034d0e
[5434567.980638] RDX: 0000000000000800 RSI: 000055d82f508d30 RDI: 00000000ffffff9c
[5434567.980639] RBP: 0000000000000000 R08: 0000000000000000 R09: 000055d82f3eff80
[5434567.980641] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8b49a38bd0
[5434567.980642] R13: 00007f8b49a38af8 R14: 0000000000000000 R15: 00007f8b3c035490

and all access to __blkdev_get() blocks on this volume too:

[5437427.042327] vgs             D    0 32353      1 0x80004004
[5437427.042334] Call Trace:
[5437427.042341]  ? __schedule+0x1e8/0x600
[5437427.042347]  schedule+0x29/0x90
[5437427.042354]  schedule_timeout+0x208/0x300
[5437427.042363]  ? __kfifo_from_user_r+0xb0/0xb0 
[5437427.042369]  ? __percpu_ref_switch_mode+0xcd/0x170
[5437427.042375]  ? __wake_up_common_lock+0x89/0xc0
[5437427.042381]  wait_for_completion+0x111/0x180 
[5437427.042387]  ? wake_up_q+0x70/0x70
[5437427.042395]  exit_aio+0xdc/0xe3
[5437427.042404]  mmput+0x28/0x120
[5437427.042411]  do_exit+0x28b/0xb90
[5437427.042420]  ? __switch_to+0x3a6/0x460
[5437427.042425]  ? _cond_resched+0x15/0x30
[5437427.042428]  ? mutex_lock+0xe/0x30
[5437427.042430]  ? aio_read_events+0x28a/0x320
[5437427.042434]  do_group_exit+0x3a/0xa0
[5437427.042438]  get_signal+0x14b/0x840
[5437427.042444]  do_signal+0x36/0x660
[5437427.042449]  ? __hrtimer_init+0xb0/0xb0
[5437427.042452]  ? do_io_getevents+0x7c/0xc0
[5437427.042456]  exit_to_usermode_loop+0x71/0xe0
[5437427.042459]  do_syscall_64+0xf2/0x100
[5437427.042463]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5437427.042465] RIP: 0033:0x7fc12f303f59
[5437427.042470] Code: Bad RIP value.
[5437427.042472] RSP: 002b:00007fffffad42e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d0
[5437427.042474] RAX: fffffffffffffffc RBX: 00007fc12efa2700 RCX: 00007fc12f303f59
[5437427.042475] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 00007fc12f974000
[5437427.042477] RBP: 00007fc12f974000 R08: 0000000000000000 R09: 00000002fffffe30
[5437427.042478] R10: 00007fffffad4370 R11: 0000000000000246 R12: 0000000000000001
[5437427.042479] R13: 0000000000000002 R14: 0000000000000040 R15: 00007fffffad4370

Full dmesg is attached.

-- Steve
Attachment:
dmesg.xz

Description: application/xz