Hi, Good Morning! We see the IO stall on backing disk sdh when it hangs - literally no IO, but a very few, per this sort of thing in diskstat: alslater@HPE-W5P7CGPQYL collectl % grep 21354078 sdhi.out | sed 's/.*disk//'|wc -l 1003 alslater@HPE-W5P7CGPQYL collectl % grep 21354078 sdhi.out | sed 's/.*disk//'|uniq -c 1 8 112 sdh 21354078 11338 20953907501 1972079123 18657407 324050 16530008823 580990600 0 17845212 2553245350 1 8 112 sdh 21354078 11338 20953907501 1972079123 18657429 324051 16530009041 580990691 0 17845254 2553245441 1 8 112 sdh 21354078 11338 20953907501 1972079123 18657431 324051 16530009044 580990691 0 17845254 2553245441 1000 8 112 sdh 21354078 11338 20953907501 1972079123 18657433 324051 16530009047 580990691 0 17845254 2553245441 ^ /very/ slight changes these write cols -> (these are diskstat metrics per 3.10 era, read metrics first, then writes) And there is a spike in sleeping on logspace concurrent with fail. Prior backtraces had xlog_grant_head_check hungtasks but currently with noop scheduler change (from deadline which was our default), and xfssyncd dialled down to 10s, we get: bc3: /proc/25146 xfsaild/sdh [<ffffffffc11aa9f7>] xfs_buf_iowait+0x27/0xc0 [xfs] [<ffffffffc11ac320>] __xfs_buf_submit+0x130/0x250 [xfs] [<ffffffffc11ac465>] _xfs_buf_read+0x25/0x30 [xfs] [<ffffffffc11ac569>] xfs_buf_read_map+0xf9/0x160 [xfs] [<ffffffffc11de299>] xfs_trans_read_buf_map+0xf9/0x2d0 [xfs] [<ffffffffc119fe9e>] xfs_imap_to_bp+0x6e/0xe0 [xfs] [<ffffffffc11c265a>] xfs_iflush+0xda/0x250 [xfs] [<ffffffffc11d4f16>] xfs_inode_item_push+0x156/0x1a0 [xfs] [<ffffffffc11dd1ef>] xfsaild+0x38f/0x780 [xfs] [<ffffffff956c32b1>] kthread+0xd1/0xe0 [<ffffffff95d801dd>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff bbm: /proc/22022 xfsaild/sdh [<ffffffffc12d09f7>] xfs_buf_iowait+0x27/0xc0 [xfs] [<ffffffffc12d2320>] __xfs_buf_submit+0x130/0x250 [xfs] [<ffffffffc12d2465>] _xfs_buf_read+0x25/0x30 [xfs] [<ffffffffc12d2569>] xfs_buf_read_map+0xf9/0x160 [xfs] [<ffffffffc1304299>] xfs_trans_read_buf_map+0xf9/0x2d0 [xfs] [<ffffffffc12c5e9e>] xfs_imap_to_bp+0x6e/0xe0 [xfs] [<ffffffffc12e865a>] xfs_iflush+0xda/0x250 [xfs] [<ffffffffc12faf16>] xfs_inode_item_push+0x156/0x1a0 [xfs] [<ffffffffc13031ef>] xfsaild+0x38f/0x780 [xfs] [<ffffffffbe4c32b1>] kthread+0xd1/0xe0 [<ffffffffbeb801dd>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff .. along with cofc threads in isr waiting for data. What that doesn't tell us yet is who's the symptom versus who's the cause. Might be lack of / lost interrupt handling, might be lack of pushing the xfs log hard enough out, might be combination of timing aspects.. Wondering if there is any patch which address this issue, please let me know. Any pointers to further debug would also help. Thanks Priya