Dave Chinner wrote: > On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote: > > I got two problems (one is stall at io_schedule() > > This is a typical "blame the messenger" bug report. XFS is stuck in > inode reclaim waiting for log IO completion to occur, along with all > the other processes iin xfs_log_force also stuck waiting for the > same Io completion. I wanted to know whether transaction based reservations can solve these problems. Inside filesystem layer, I guess you can calculate how much memory is needed for your filesystem transaction. But I'm wondering whether we can calculate how much memory is needed inside block layer. If block layer failed to reserve memory, won't file I/O fail under extreme memory pressure? And if __GFP_NOFAIL were used inside block layer, won't the OOM killer deadlock problem arise? > > You need to find where that IO completion that everything is waiting > on has got stuck or show that it's not a lost IO and actually an > XFS problem. e.g has the IO stack got stuck on a mempool somewhere? > I didn't get a vmcore for this stall. But it seemed to me that kworker/3:0H is doing xfs_fs_free_cached_objects() => xfs_reclaim_inodes_nr() => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) => xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock) was succeessful => xfs_iunpin_wait(ip) because xfs_ipincount(ip) was non 0 => __xfs_iunpin_wait() => waiting inside io_schedule() for somebody to unpin kswapd0 is doing xfs_fs_free_cached_objects() => xfs_reclaim_inodes_nr() => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) => not calling xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock) failed due to kworker/3:0H => SYNC_TRYLOCK is dropped for retry loop due to if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { trylock = 0; goto restart; } => calling mutex_lock(&pag->pag_ici_reclaim_lock) and gets blocked due to kworker/3:0H kworker/3:0H is trying to free memory but somebody needs memory to make forward progress. kswapd0 is also trying to free memory but is blocked by kworker/3:0H already holding the lock. Since kswapd0 cannot make forward progress, somebody can't allocate memory. Finally the system started stalling. Is this decoding correct? ---------- [ 1225.773411] kworker/3:0H D ffff88007cadb4f8 11632 27 2 0x00000000 [ 1225.776911] ffff88007cadb4f8 ffff88007cadb508 ffff88007cac6740 0000000000014080 [ 1225.780670] ffffffff8101cd19 ffff88007cadbfd8 0000000000014080 ffff88007c28b740 [ 1225.784431] ffff88007cac6740 ffff88007cadb540 ffff88007f8d4998 ffff88007cadb540 [ 1225.788766] Call Trace: [ 1225.789988] [<ffffffff8101cd19>] ? read_tsc+0x9/0x10 [ 1225.792444] [<ffffffff812acbd9>] ? xfs_iunpin_wait+0x19/0x20 [ 1225.795228] [<ffffffff816b2590>] io_schedule+0xa0/0x130 [ 1225.797802] [<ffffffff812a9569>] __xfs_iunpin_wait+0xe9/0x140 arch/x86/include/asm/atomic.h:27 fs/xfs/xfs_inode.c:2433 [ 1225.800621] [<ffffffff810af3b0>] ? autoremove_wake_function+0x40/0x40 [ 1225.803770] [<ffffffff812acbd9>] xfs_iunpin_wait+0x19/0x20 fs/xfs/xfs_inode.c:2443 [ 1225.806471] [<ffffffff812a209c>] xfs_reclaim_inode+0x7c/0x360 include/linux/spinlock.h:309 fs/xfs/xfs_inode.h:144 fs/xfs/xfs_icache.c:920 [ 1225.809283] [<ffffffff812a25d7>] xfs_reclaim_inodes_ag+0x257/0x370 fs/xfs/xfs_icache.c:1105 [ 1225.812308] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.815532] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.817951] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 fs/xfs/xfs_icache.c:1166 [ 1225.819373] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.820898] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.822245] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.823588] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.824830] [<ffffffff8115e0dc>] do_try_to_free_pages+0x16c/0x460 [ 1225.826230] [<ffffffff8115e48a>] try_to_free_pages+0xba/0x150 [ 1225.827570] [<ffffffff81151542>] __alloc_pages_nodemask+0x5b2/0x9d0 [ 1225.829030] [<ffffffff8119ecbc>] kmem_getpages+0x8c/0x200 [ 1225.830277] [<ffffffff811a122b>] fallback_alloc+0x17b/0x230 [ 1225.831561] [<ffffffff811a107b>] ____cache_alloc_node+0x18b/0x1c0 [ 1225.833061] [<ffffffff811a3b00>] kmem_cache_alloc+0x330/0x5c0 [ 1225.834435] [<ffffffff8133c9d9>] ? ida_pre_get+0x69/0x100 [ 1225.835719] [<ffffffff8133c9d9>] ida_pre_get+0x69/0x100 [ 1225.836963] [<ffffffff8133d312>] ida_simple_get+0x42/0xf0 [ 1225.838248] [<ffffffff81086211>] create_worker+0x31/0x1c0 [ 1225.839519] [<ffffffff81087831>] worker_thread+0x3d1/0x4d0 [ 1225.840800] [<ffffffff81087460>] ? rescuer_thread+0x3a0/0x3a0 [ 1225.842123] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.843234] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.844978] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.846481] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.847718] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.849279] kswapd0 D ffff88007708f998 11552 45 2 0x00000000 [ 1225.850977] ffff88007708f998 0000000000000000 ffff88007c28b740 0000000000014080 [ 1225.852798] 0000000000000003 ffff88007708ffd8 0000000000014080 ffff880077ff2740 [ 1225.854575] ffff88007c28b740 0000000000000000 ffff88007948e3a8 ffff88007948e3ac [ 1225.856358] Call Trace: [ 1225.856928] [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70 [ 1225.858384] [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100 [ 1225.859799] [<ffffffff816b4463>] mutex_lock+0x23/0x37 arch/x86/include/asm/current.h:14 kernel/locking/mutex.h:22 kernel/locking/mutex.c:103 [ 1225.860983] [<ffffffff812a264c>] xfs_reclaim_inodes_ag+0x2cc/0x370 fs/xfs/xfs_icache.c:1034 [ 1225.862403] [<ffffffff8109eb48>] ? __enqueue_entity+0x78/0x80 [ 1225.863742] [<ffffffff810a5f37>] ? enqueue_entity+0x237/0x8f0 [ 1225.865100] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.866659] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.868106] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 fs/xfs/xfs_icache.c:1166 [ 1225.869522] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.871015] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.872338] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.873679] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.874920] [<ffffffff8115ed2d>] kswapd+0x4fd/0x9c0 [ 1225.876049] [<ffffffff8115e830>] ? mem_cgroup_shrink_node_zone+0x140/0x140 [ 1225.877654] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.878762] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.880495] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.881996] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.883336] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 ---------- I killed mutex_lock() and memory allocation from shrinker functions in drivers/gpu/drm/ttm/ttm_page_alloc[_dma].c because I observed that kswapd0 was blocked for so long at mutex_lock(). If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker functions, who else can make forward progress? Shouldn't we avoid calling functions which could potentially block for unpredictable duration (e.g. unkillable locks and/or completion) from shrinker functions? > IOWs, when you run CONFIG_XFS_DEBUG=y, you'll often get failures > that are valuable to XFS developers but have no runtime effect on > production systems. Oh, I didn't know this failure is specific to CONFIG_XFS_DEBUG=y ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>