Hello - I have an HP DL380 Gen 9 with a RAID5 array built from 6 INTEL SSDPE2MX020T4 devices. That raid device makes up a volume group with a couple logical volumes with XFS filesystems backing VM storage. Twice now in 2 months the raid array has become mostly unresponsive: May 08 03:33:21 host kernel: INFO: task worker:1798511 blocked for more than 120 seconds. May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1 May 08 03:33:21 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 08 03:33:21 host kernel: task:worker state:D stack: 0 pid:1798511 ppid: 1 flags:0x000043a0 May 08 03:33:21 host kernel: Call Trace: May 08 03:33:21 host kernel: __schedule+0x2bd/0x760 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: schedule+0x37/0xa0 May 08 03:33:21 host kernel: md_bitmap_startwrite+0x16f/0x1e0 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: add_stripe_bio+0x4a3/0x7c0 [raid456] May 08 03:33:21 host kernel: raid5_make_request+0x1bf/0xb60 [raid456] May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: ? blk_queue_split+0xd4/0x660 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: md_handle_request+0x119/0x190 May 08 03:33:21 host kernel: md_make_request+0x84/0x160 May 08 03:33:21 host kernel: generic_make_request+0x25b/0x350 May 08 03:33:21 host kernel: submit_bio+0x3c/0x160 May 08 03:33:21 host kernel: iomap_submit_ioend.isra.38+0x4a/0x70 May 08 03:33:21 host kernel: iomap_writepage_map+0x422/0x670 May 08 03:33:21 host kernel: write_cache_pages+0x197/0x420 May 08 03:33:21 host kernel: ? iomap_invalidatepage+0xe0/0xe0 May 08 03:33:21 host kernel: iomap_writepages+0x1c/0x40 May 08 03:33:21 host kernel: xfs_vm_writepages+0x64/0x90 [xfs] May 08 03:33:21 host kernel: do_writepages+0x41/0xd0 May 08 03:33:21 host kernel: __filemap_fdatawrite_range+0xcb/0x100 May 08 03:33:21 host kernel: file_write_and_wait_range+0x4c/0xa0 May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs] May 08 03:33:21 host kernel: do_fsync+0x38/0x70 May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20 May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0 May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca May 08 03:33:21 host kernel: RIP: 0033:0x7f969efb858f May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP 0x7f969efb8565. May 08 03:33:21 host kernel: RSP: 002b:00007f94b3ffe6b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007f969efb858f May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000e May 08 03:33:21 host kernel: RBP: 0000563f940b5b20 R08: 0000000000000000 R09: 0000000032f01b0c May 08 03:33:21 host kernel: R10: 0000000e171e5000 R11: 0000000000000293 R12: 0000563f92a73bb4 May 08 03:33:21 host kernel: R13: 0000563f940b5b88 R14: 0000563f94097eb0 R15: 00007f94b3ffe800 May 08 03:33:21 host kernel: INFO: task worker:1799573 blocked for more than 120 seconds. May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1 May 08 03:33:21 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 08 03:33:21 host kernel: task:worker state:D stack: 0 pid:1799573 ppid: 1 flags:0x000043a0 May 08 03:33:21 host kernel: Call Trace: May 08 03:33:21 host kernel: __schedule+0x2bd/0x760 May 08 03:33:21 host kernel: schedule+0x37/0xa0 May 08 03:33:21 host kernel: io_schedule+0x12/0x40 May 08 03:33:21 host kernel: wait_on_page_bit+0x137/0x230 May 08 03:33:21 host kernel: ? file_fdatawait_range+0x20/0x20 May 08 03:33:21 host kernel: __filemap_fdatawait_range+0x88/0xe0 May 08 03:33:21 host kernel: file_write_and_wait_range+0x76/0xa0 May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs] May 08 03:33:21 host kernel: do_fsync+0x38/0x70 May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20 May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0 May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca May 08 03:33:21 host kernel: RIP: 0033:0x7f20c514c58f May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP 0x7f20c514c565. May 08 03:33:21 host kernel: RSP: 002b:00007f1ef4ff86b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00007f20c514c58f May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001b May 08 03:33:21 host kernel: RBP: 00005594bed1f120 R08: 0000000000000000 R09: 00000000ffffffff May 08 03:33:21 host kernel: R10: 00007f1ef4ff86a0 R11: 0000000000000293 R12: 00005594bd72ebb4 May 08 03:33:21 host kernel: R13: 00005594bed1f188 R14: 00005594bed31c30 R15: 00007f1ef4ff8800 May 08 03:33:21 host kernel: INFO: task worker:871154 blocked for more than 120 seconds. May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1 May 08 03:33:21 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 08 03:33:21 host kernel: task:worker state:D stack: 0 pid:871154 ppid: 1 flags:0x000043a0 May 08 03:33:21 host kernel: Call Trace: May 08 03:33:21 host kernel: __schedule+0x2bd/0x760 May 08 03:33:21 host kernel: schedule+0x37/0xa0 May 08 03:33:21 host kernel: io_schedule+0x12/0x40 May 08 03:33:21 host kernel: wait_on_page_bit+0x137/0x230 May 08 03:33:21 host kernel: ? file_fdatawait_range+0x20/0x20 May 08 03:33:21 host kernel: __filemap_fdatawait_range+0x88/0xe0 May 08 03:33:21 host kernel: file_write_and_wait_range+0x76/0xa0 May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs] May 08 03:33:21 host kernel: do_fsync+0x38/0x70 May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20 May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0 May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca May 08 03:33:21 host kernel: RIP: 0033:0x7f13d27fd58f May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP 0x7f13d27fd565. May 08 03:33:21 host kernel: RSP: 002b:00007f0f697f96b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007f13d27fd58f May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000e May 08 03:33:21 host kernel: RBP: 00005594f48b9010 R08: 0000000000000000 R09: 00000000ffffffff May 08 03:33:21 host kernel: R10: 00007f0f697f96a0 R11: 0000000000000293 R12: 00005594f2222bb4 May 08 03:33:21 host kernel: R13: 00005594f48b9078 R14: 00005594f4e8ee50 R15: 00007f0f697f9800 May 08 03:33:21 host kernel: INFO: task kworker/u97:2:1790841 blocked for more than 120 seconds. May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1 May 08 03:33:21 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 08 03:33:21 host kernel: task:kworker/u97:2 state:D stack: 0 pid:1790841 ppid: 2 flags:0x80004080 May 08 03:33:21 host kernel: Workqueue: writeback wb_workfn (flush-253:3) May 08 03:33:21 host kernel: Call Trace: May 08 03:33:21 host kernel: __schedule+0x2bd/0x760 May 08 03:33:21 host kernel: ? blk_flush_plug_list+0xc2/0x100 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: schedule+0x37/0xa0 May 08 03:33:21 host kernel: md_bitmap_startwrite+0x16f/0x1e0 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: add_stripe_bio+0x4a3/0x7c0 [raid456] May 08 03:33:21 host kernel: raid5_make_request+0x1bf/0xb60 [raid456] May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: ? blk_queue_split+0xd4/0x660 May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80 May 08 03:33:21 host kernel: md_handle_request+0x119/0x190 May 08 03:33:21 host kernel: md_make_request+0x84/0x160 May 08 03:33:21 host kernel: generic_make_request+0x25b/0x350 May 08 03:33:21 host kernel: submit_bio+0x3c/0x160 May 08 03:33:21 host kernel: iomap_submit_ioend.isra.38+0x4a/0x70 May 08 03:33:21 host kernel: iomap_writepage_map+0x422/0x670 May 08 03:33:21 host kernel: write_cache_pages+0x197/0x420 May 08 03:33:21 host kernel: ? iomap_invalidatepage+0xe0/0xe0 May 08 03:33:21 host kernel: iomap_writepages+0x1c/0x40 May 08 03:33:21 host kernel: xfs_vm_writepages+0x64/0x90 [xfs] May 08 03:33:21 host kernel: do_writepages+0x41/0xd0 May 08 03:33:21 host kernel: __writeback_single_inode+0x39/0x2f0 May 08 03:33:21 host kernel: writeback_sb_inodes+0x1e6/0x450 May 08 03:33:21 host kernel: __writeback_inodes_wb+0x5f/0xc0 May 08 03:33:21 host kernel: wb_writeback+0x25b/0x2f0 May 08 03:33:21 host kernel: wb_workfn+0x344/0x4c0 May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70 May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70 May 08 03:33:21 host kernel: process_one_work+0x1a7/0x360 May 08 03:33:21 host kernel: worker_thread+0x30/0x390 May 08 03:33:21 host kernel: ? create_worker+0x1a0/0x1a0 May 08 03:33:21 host kernel: kthread+0x116/0x130 May 08 03:33:21 host kernel: ? kthread_flush_work_fn+0x10/0x10 May 08 03:33:21 host kernel: ret_from_fork+0x35/0x40 I have another nearly identical system that has run without trouble, though not with as much IO load as this one. Is there anything else I can check to see if there is a hardware issue or if this might be an issue with the linux RAID system? Is there a better place to ask for help? Thank you. -- Orion Poplawski IT Systems Manager 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane orion@xxxxxxxx Boulder, CO 80301 https://www.nwra.com/
<<attachment: smime.p7s>>