Re: Tasks blocking forever with XFS stack traces

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Tue, 5 Nov 2019 09:54:46 +0100

Hi.

On Tue, Nov 05, 2019 at 07:27:16AM +0000, Sitsofe Wheeler wrote:
> Hi,
> 
> We have a system that has been seeing tasks with XFS calls in their
> stacks. Once these tasks start hanging with uninterruptible sleep any
> write I/O to the directory they were doing I/O to will also hang
> forever. The I/O they doing is being done to a bind mounted directory
> atop an XFS filesystem on top an MD device (the MD device seems to be
> still functional and isn't offline). The kernel is fairly old but I
> thought I'd post a stack in case anyone can describe this or has seen
> it before:
> 
> kernel: [425684.110424] INFO: task kworker/u162:0:58843 blocked for
> more than 120 seconds.
> kernel: [425684.110800]       Tainted: G           OE
> 4.15.0-64-generic #73-Ubuntu
> kernel: [425684.111164] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kernel: [425684.111568] kworker/u162:0  D    0 58843      2 0x80000080
> kernel: [425684.111581] Workqueue: writeback wb_workfn (flush-9:126)
> kernel: [425684.111585] Call Trace:
> kernel: [425684.111595]  __schedule+0x24e/0x880
> kernel: [425684.111664]  ? xfs_map_blocks+0x82/0x250 [xfs]
> kernel: [425684.111668]  schedule+0x2c/0x80
> kernel: [425684.111671]  rwsem_down_read_failed+0xf0/0x160
> kernel: [425684.111675]  ? bitmap_startwrite+0x9f/0x1f0
> kernel: [425684.111679]  call_rwsem_down_read_failed+0x18/0x30
> kernel: [425684.111682]  ? call_rwsem_down_read_failed+0x18/0x30
> kernel: [425684.111685]  down_read+0x20/0x40
> kernel: [425684.111736]  xfs_ilock+0xd5/0x100 [xfs]
> kernel: [425684.111782]  xfs_map_blocks+0x82/0x250 [xfs]
> kernel: [425684.111823]  xfs_do_writepage+0x167/0x6a0 [xfs]
> kernel: [425684.111830]  ? clear_page_dirty_for_io+0x19f/0x1f0
> kernel: [425684.111834]  write_cache_pages+0x207/0x4e0
> kernel: [425684.111869]  ? xfs_vm_writepages+0xf0/0xf0 [xfs]
> kernel: [425684.111875]  ? submit_bio+0x73/0x140
> kernel: [425684.111878]  ? submit_bio+0x73/0x140
> kernel: [425684.111911]  ? xfs_setfilesize_trans_alloc.isra.13+0x3e/0x90 [xfs]
> kernel: [425684.111944]  xfs_vm_writepages+0xbe/0xf0 [xfs]
> kernel: [425684.111949]  do_writepages+0x4b/0xe0
> kernel: [425684.111954]  ? fprop_fraction_percpu+0x2f/0x80
> kernel: [425684.111958]  ? __wb_calc_thresh+0x3e/0x130
> kernel: [425684.111963]  __writeback_single_inode+0x45/0x350
> kernel: [425684.111966]  ? __writeback_single_inode+0x45/0x350
> kernel: [425684.111970]  writeback_sb_inodes+0x1e1/0x510
> kernel: [425684.111975]  __writeback_inodes_wb+0x67/0xb0
> kernel: [425684.111979]  wb_writeback+0x271/0x300
> kernel: [425684.111983]  wb_workfn+0x1bb/0x400
> kernel: [425684.111986]  ? wb_workfn+0x1bb/0x400
> kernel: [425684.111992]  process_one_work+0x1de/0x420
> kernel: [425684.111996]  worker_thread+0x32/0x410
> kernel: [425684.111999]  kthread+0x121/0x140
> kernel: [425684.112003]  ? process_one_work+0x420/0x420
> kernel: [425684.112005]  ? kthread_create_worker_on_cpu+0x70/0x70
> kernel: [425684.112009]  ret_from_fork+0x35/0x40
> kernel: [425684.112024] INFO: task kworker/74:0:9623 blocked for more
> than 120 seconds.
> kernel: [425684.112461]       Tainted: G           OE
> 4.15.0-64-generic #73-Ubuntu
> kernel: [425684.112925] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kernel: [425684.113438] kworker/74:0    D    0  9623      2 0x80000080
> kernel: [425684.113500] Workqueue: xfs-cil/md126 xlog_cil_push_work [xfs]
> kernel: [425684.113502] Call Trace:
> kernel: [425684.113508]  __schedule+0x24e/0x880
> kernel: [425684.113559]  ? xlog_bdstrat+0x2b/0x60 [xfs]
> kernel: [425684.113564]  schedule+0x2c/0x80
> kernel: [425684.113609]  xlog_state_get_iclog_space+0x105/0x2d0 [xfs]
> kernel: [425684.113614]  ? wake_up_q+0x80/0x80
> kernel: [425684.113656]  xlog_write+0x163/0x6e0 [xfs]
> kernel: [425684.113699]  xlog_cil_push+0x2a7/0x410 [xfs]
> kernel: [425684.113740]  xlog_cil_push_work+0x15/0x20 [xfs]
> kernel: [425684.113743]  process_one_work+0x1de/0x420
> kernel: [425684.113747]  worker_thread+0x32/0x410
> kernel: [425684.113750]  kthread+0x121/0x140
> kernel: [425684.113753]  ? process_one_work+0x420/0x420
> kernel: [425684.113756]  ? kthread_create_worker_on_cpu+0x70/0x70
> kernel: [425684.113759]  ret_from_fork+0x35/0x40
> 
> Other directories on the same filesystem seem fine as do other XFS
> filesystems on the same system.

The fact you mention other directories seems to work, and the first stack trace
you posted, it sounds like you've been keeping a singe AG too busy to almost
make it unusable. But, you didn't provide enough information we can really make
any progress here, and to be honest I'm more inclined to point the finger to
your MD device.

Can you describe your MD device? RAID array? What kind? How many disks?
What's your filesystem configuration? (xfs_info <mount point>) 
Do you have anything else on your dmesg other than these two stack traces? I'd
suggest posting the whole dmesg, not only what you think is relevant.

Better yet:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Cheers.

> 
> -- 
> Sitsofe | http://sucs.org/~sits/

-- 
Carlos

P.S. I'm removing Darrick and linux-fsdevel from CC to avoid spamming too many.