Thanks for these in depth explanations. I understood my mistake. "xfs_reclaim_inodes_nr" ensures the background job on queue #0 is scheduled but also run a synchronous reclaim which is blocked by the I/O. For some reason, I focused on the job and completely forgot the synchronous part. Hence the only way to fix this hang is to fix flashcache to use a dedicated queue with a rescuer thread. Which has nothing to do with XFS. Regards, On 02/24/2016 10:30 PM, Dave Chinner wrote: > On Wed, Feb 24, 2016 at 06:27:29PM +0100, Jean-Tiare Le Bigot wrote: >> Hi, >> >> Thanks for having a look. >> >> On 02/23/2016 11:45 PM, Dave Chinner wrote: >>> On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote: >>>> Hi, >>>> >>>> We've hit kernel hang related to XFS reclaim under heavy I/O load on a >>>> couple of storage servers using XFS over flashcache over a 3.13.y kernel. >>>> >>>> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some >>>> memory but the related reclaim job is queued on a worker_pool stuck >>>> waiting for some I/O, itself depending on other jobs on other queues >>>> which would require additional threads to go forward. Unfortunately >>>> kthreadd is blocked. >>>> The host has plenty of memory (~128GB), about 80% of which being used >>>> for the page cache. >>>> >>>> It looks like this is fixed by commit >>>> 7a29ac474a47eb8cf212b45917683ae89d6fa13b. >>> >>> That commit fixed a regression introduced, IIRC, in 3.19. The >>> problem it fixed didn't exist before then, so I doubt you are seeing >>> the problem that the above commit fixed. Perhaps you'd like to >>> describe your problem along with the stack traces, etc so we have >>> some idea of what you are trying to to fix? >> >> On the dump we investigated, kthreadd is stuck, waiting for some memory >> >> crash> bt 2 >> PID: 2 TASK: ffff881fd2a39800 CPU: 0 COMMAND: "kthreadd" >> #0 [ffff881fd2ab37c8] __schedule at ffffffff81724e19 >> #1 [ffff881fd2ab3830] io_schedule at ffffffff817255fd >> #2 [ffff881fd2ab3848] __xfs_iunpin_wait at ffffffffa06cf269 [xfs] >> #3 [ffff881fd2ab38c0] xfs_iunpin_wait at ffffffffa06d2629 [xfs] >> #4 [ffff881fd2ab38d0] xfs_reclaim_inode at ffffffffa068ed4c [xfs] >> #5 [ffff881fd2ab3910] xfs_reclaim_inodes_ag at ffffffffa068f267 [xfs] >> #6 [ffff881fd2ab3aa0] xfs_reclaim_inodes_nr at ffffffffa068fd73 [xfs] >> #7 [ffff881fd2ab3ac0] xfs_fs_free_cached_objects at ffffffffa069a3a5 [xfs] >> #8 [ffff881fd2ab3ad0] super_cache_scan at ffffffff811c13e9 >> #9 [ffff881fd2ab3b18] shrink_slab at ffffffff81160f27 >> #10 [ffff881fd2ab3bc0] do_try_to_free_pages at ffffffff8116405d >> #11 [ffff881fd2ab3c38] try_to_free_pages at ffffffff8116429c >> #12 [ffff881fd2ab3cd8] __alloc_pages_nodemask at ffffffff81158f15 >> #13 [ffff881fd2ab3e10] copy_process at ffffffff810652b3 >> #14 [ffff881fd2ab3e90] do_fork at ffffffff810669f5 >> #15 [ffff881fd2ab3ef8] kernel_thread at ffffffff81066c86 >> #16 [ffff881fd2ab3f08] kthreadd at ffffffff8108beea >> #17 [ffff881fd2ab3f50] ret_from_fork at ffffffff817318bc > > So, it's waiting for a log force to complete because it needs to > reclaim a dirty inode... > >> This triggered the insertion of a xfs-reclaim job for device dm-46 on >> the workerpool bound to cpu #0 >> >> This xfs_reclaim job is the first *pending* job on the workerpool. The >> workerpool has 2 active workers stuck in "xfs_log_worker" both blocked >> in "xlog_state_get_iclog_space". My guess is they are waiting for some >> underlying flashcache I/O (which won't happen, see below) >> >> >> --> busy_hash aka busy workers: >> crash> struct worker_pool.busy_hash 0xffff88207fc12d40 | grep -o 0xf.* >> 0xffff8806dd3b5780 --> 132626 >> --> in xfs_log_worker >> --> in xlog_state_get_iclog_space >> --> dm-40 >> 0xffff88142d06ae80 --> 129142 >> --> in xfs_log_worker >> --> in xlog_state_get_iclog_space >> --> dm-63 > > These are both waiting on IO completion of log IO. And the IO > completion is blocked behind metadata buffer completion that is > blocked waiting on locks. Yes, this can happen when we are running > low on memory, but it can also occur when we hit maximum workqueue > concurrency limits. The problem is not solved by adding a rescuer > thread to the workqueue because it ends up blocking, too. > >> --> pending jobs: >> crash> list -o work_struct.entry -s work_struct.func -H ffff88207fc12d58 >> -x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("%2d %s --> %s\n", c, >> w, $4); else {w=$1;c++}}' >> WORK FUNC DEVICE >> 1 ffff881fceda6ca8 --> <xfs_reclaim_worker> dm-46 >> 2 ffff881fcd51bc28 --> <xfs_log_worker> dm-46 >> 3 ffff88207fc0f3a0 --> <vmstat_update> N/A >> ... >> >> To progress this queue needs either: >> - the I/O to complete >> - a new thread to handle the xfs_reclaim >> (which does not trigger I/O in this context, If I got the code right) > > The reclaim worker runs background inode reclaim - it doesn't > guarantee forwards progress. Running it won't make the log IO > completion the kthreadd is waiting on complete any faster, because > that's not what is preventing IO completion from running. > >> The I/O is stuck so we need a new thread. The pool manager requested >> this thread BUT the request is pending in kthreadd queue. It is #5 in >> the line > > Actually, we needed a different workqueue that is marked as a high > priority workqueue so log IO completion doesn't get held up by > metadata buffer IO completion waiting on locks and stalling. This > is the problem that commit b29c70f59870 ("xfs: split metadata and > log buffer completion to separate workqueues") addressed in 3.18 > addressed. > > Remember how I said commit 7a29ac474 ("xfs: give all workqueues > rescuer threads") fixed a regression we'd introduced, not the > problem you are reporting? Yeah, in commit b29c70f59870 I forgot to > add the WQ_MEM_RECLAIM to the m_log_workqueue when we moved all the > log IO completions to it and that introduced the problem seen in > subsequent kernels. > > Cheers, > > Dave. > -- Jean-Tiare Le Bigot, OVH _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs