Re: [PATCHv3 2/2] nvme: cancel requests for real

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 29, 2020 at 4:19 AM Alan Adamson <alan.adamson@xxxxxxxxxx> wrote:
>
> It did come back.  Just took a little longer.
>
> Alan
>
> [  394.319378] nvme nvme0: I/O 960 QID 26 timeout, reset controller
> [  394.334228] nvme nvme0: new error during reset
> [  394.340125] nvme nvme0: Shutdown timeout set to 10 seconds
> [  394.343452] nvme nvme0: 56/0/0 default/read/poll queues
> [  394.343557] FAULT_INJECTION: forcing a failure.
>                 name fail_io_timeout, interval 1, probability 100, space
> 0, times 995
> [  394.343559] CPU: 53 PID: 0 Comm: swapper/53 Not tainted 5.7.0-rc7+ #2
> [  394.343559] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [  394.343560] Call Trace:
> [  394.343561]  <IRQ>
> [  394.343563]  dump_stack+0x6d/0x9a
> [  394.343564]  should_fail.cold.5+0x32/0x42
> [  394.343566]  blk_should_fake_timeout+0x26/0x30
> [  394.343566]  blk_mq_complete_request+0x15/0x30
> [  394.343569]  nvme_irq+0xd9/0x1f0 [nvme]
> [  394.343571]  __handle_irq_event_percpu+0x44/0x190
> [  394.343573]  handle_irq_event_percpu+0x32/0x80
> [  394.343574]  handle_irq_event+0x3b/0x5a
> [  394.343575]  handle_edge_irq+0x87/0x190
> [  394.343577]  do_IRQ+0x54/0xe0
> [  394.343578]  common_interrupt+0xf/0xf
> [  394.343579]  </IRQ>
> [  394.343581] RIP: 0010:cpuidle_enter_state+0xc1/0x400
> [  394.343582] Code: ff e8 e3 41 93 ff 80 7d c7 00 74 17 9c 58 0f 1f 44
> 00 00 f6 c4 02 0f 85 d2 02 00 00 31 ff e8 16 c3 99 ff fb 66 0f 1f 44 00
> 00 <45> 85 e4 0f 88 3d 02 00 00 49 63 c4 48 8d 14 40 48 8d 0c c5 00 00
> [  394.343583] RSP: 0018:ffffae4ecc6cbe40 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffdd
> [  394.343584] RAX: ffff9b537fc6cc40 RBX: ffffce4ebfc43200 RCX:
> 000000000000001f
> [  394.343585] RDX: 0000005bd0b51f04 RSI: 0000000031573862 RDI:
> 0000000000000000
> [  394.343585] RBP: ffffae4ecc6cbe80 R08: 0000000000000002 R09:
> 000000000002c4c0
> [  394.343586] R10: 011c65ca3dee2b0c R11: ffff9b537fc6bb44 R12:
> 0000000000000002
> [  394.343586] R13: ffffffffa374c120 R14: ffffffffa374c208 R15:
> ffffffffa374c1f0
> [  394.343588]  cpuidle_enter+0x2e/0x40
> [  394.343590]  call_cpuidle+0x23/0x40
> [  394.343591]  do_idle+0x230/0x270
> [  394.343592]  cpu_startup_entry+0x1d/0x20
> [  394.343594]  start_secondary+0x170/0x1c0
> [  394.343596]  secondary_startup_64+0xb6/0xc0
> [  424.524341] nvme nvme0: I/O 960 QID 26 timeout, aborting
> [  424.524389] nvme nvme0: Abort status: 0x0
> [  454.729622] nvme nvme0: I/O 960 QID 26 timeout, reset controller
> [  454.740442] nvme nvme0: new error during reset
> [  454.750403] nvme nvme0: Shutdown timeout set to 10 seconds
> [  454.753508] nvme nvme0: 56/0/0 default/read/poll queues
> [  454.753596] FAULT_INJECTION: forcing a failure.
>                 name fail_io_timeout, interval 1, probability 100, space
> 0, times 994
> [  454.753598] CPU: 53 PID: 0 Comm: swapper/53 Not tainted 5.7.0-rc7+ #2
> [  454.753599] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [  454.753599] Call Trace:
> [  454.753600]  <IRQ>
> [  454.753603]  dump_stack+0x6d/0x9a
> [  454.753604]  should_fail.cold.5+0x32/0x42
> [  454.753605]  blk_should_fake_timeout+0x26/0x30
> [  454.753606]  blk_mq_complete_request+0x15/0x30
> [  454.753609]  nvme_irq+0xd9/0x1f0 [nvme]
> [  454.753612]  __handle_irq_event_percpu+0x44/0x190
> [  454.753613]  handle_irq_event_percpu+0x32/0x80
> [  454.753615]  handle_irq_event+0x3b/0x5a
> [  454.753616]  handle_edge_irq+0x87/0x190
> [  454.753617]  do_IRQ+0x54/0xe0
> [  454.753619]  common_interrupt+0xf/0xf
> [  454.753620]  </IRQ>
> [  454.753622] RIP: 0010:cpuidle_enter_state+0xc1/0x400
> [  454.753623] Code: ff e8 e3 41 93 ff 80 7d c7 00 74 17 9c 58 0f 1f 44
> 00 00 f6 c4 02 0f 85 d2 02 00 00 31 ff e8 16 c3 99 ff fb 66 0f 1f 44 00
> 00 <45> 85 e4 0f 88 3d 02 00 00 49 63 c4 48 8d 14 40 48 8d 0c c5 00 00
> [  454.753624] RSP: 0018:ffffae4ecc6cbe40 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffdd
> [  454.753625] RAX: ffff9b537fc6cc40 RBX: ffffce4ebfc43200 RCX:
> 000000000000001f
> [  454.753626] RDX: 00000069e16d295b RSI: 0000000031573862 RDI:
> 0000000000000000
> [  454.753626] RBP: ffffae4ecc6cbe80 R08: 0000000000000002 R09:
> 000000000002c4c0
> [  454.753627] R10: 011c65eebaf0c79e R11: ffff9b537fc6bb44 R12:
> 0000000000000002
> [  454.753627] R13: ffffffffa374c120 R14: ffffffffa374c208 R15:
> ffffffffa374c1f0
> [  454.753630]  cpuidle_enter+0x2e/0x40
> [  454.753632]  call_cpuidle+0x23/0x40
> [  454.753633]  do_idle+0x230/0x270
> [  454.753634]  cpu_startup_entry+0x1d/0x20
> [  454.753636]  start_secondary+0x170/0x1c0
> [  454.753637]  secondary_startup_64+0xb6/0xc0
> [  484.934587] nvme nvme0: I/O 960 QID 26 timeout, aborting
> [  484.934641] nvme nvme0: Abort status: 0x0
> [  492.613861] INFO: task kworker/u114:2:1182 blocked for more than 245
> seconds.
> [  492.613884]       Not tainted 5.7.0-rc7+ #2
> [  492.613912] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  492.613948] kworker/u114:2  D    0  1182      2 0x80004000
> [  492.613953] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [  492.613954] Call Trace:
> [  492.613956]  __schedule+0x2dc/0x710
> [  492.613958]  ? __kfifo_to_user_r+0x90/0x90
> [  492.613960]  schedule+0x44/0xb0
> [  492.613961]  blk_mq_freeze_queue_wait+0x4b/0xb0
> [  492.613962]  ? finish_wait+0x80/0x80
> [  492.613963]  blk_mq_freeze_queue+0x1a/0x20
> [  492.613966]  nvme_update_disk_info+0x62/0x3b0 [nvme_core]
> [  492.613968]  __nvme_revalidate_disk+0x8d/0x140 [nvme_core]
> [  492.613970]  nvme_revalidate_disk+0xa4/0x140 [nvme_core]
> [  492.613971]  ? blk_mq_run_hw_queue+0xba/0x100
> [  492.613973]  revalidate_disk+0x2b/0xa0
> [  492.613975]  nvme_validate_ns+0x46/0x5b0 [nvme_core]
> [  492.613978]  ? __nvme_submit_sync_cmd+0xe0/0x1b0 [nvme_core]
> [  492.613980]  nvme_scan_work+0x25a/0x310 [nvme_core]
> [  492.613983]  process_one_work+0x1ab/0x380
> [  492.613984]  worker_thread+0x37/0x3b0
> [  492.613985]  kthread+0x120/0x140
> [  492.613990]  ? create_worker+0x1b0/0x1b0
> [  492.613992]  ? kthread_park+0x90/0x90
> [  492.614011]  ret_from_fork+0x35/0x40
> [  515.139879] nvme nvme0: I/O 960 QID 26 timeout, reset controller
> [  515.150634] nvme nvme0: new error during reset

Looks your kernel has applied patchset of '[PATCH 0/3] blk-mq/nvme:
improve nvme-pci reset handler',
and seems it works well.

> [  515.158310] blk_update_request: I/O error, dev nvme0n1, sector 0 op
> 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> [  515.158346] Buffer I/O error on dev nvme0n1, logical block 0, async
> page read
> [  515.160800] nvme nvme0: Shutdown timeout set to 10 seconds
> [  515.163954] nvme nvme0: 56/0/0 default/read/poll queues

The controller is reset to normal state even though timeout is
triggered during reset.

The hang from scan work context just means that blk_mq_freeze_queue() waits for
too long(245sec) becuase IO request times out and is retried for 5
times, and that may
not be avoided if you fail all IO requets for 1000 times.

As I suggested to you in another thread, if you stop to fail IO
requests after injecting failure
for some time, the scan work will run to end and everything will be
fine. That said NVMe's
error handling becomes better after applying the patchs of '[PATCH
0/3] blk-mq/nvme: improve
nvme-pci reset handler'.

Thanks,
Ming Lei



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux