On Wed, Jul 10, 2013 at 10:06:05AM +0200, Michal Hocko wrote: > On Wed 10-07-13 12:31:39, Dave Chinner wrote: > [...] > > > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs] > > > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs] > > > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs] > > > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs] > > > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs] > > > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs] > > > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs] > > > [<ffffffff811763dd>] vfs_create+0xad/0xd0 > > > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0 > > > [<ffffffff8117815e>] do_last+0x2de/0x780 > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400 > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0 > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0 > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20 > > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > That's an XFS log space issue, indicating that it has run out of > > space in IO the log and it is waiting for more to come free. That > > requires IO completion to occur. > > > > > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds. > > > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000 > > > > Oh, that's why. This is the IO completion worker... > > But that task doesn't seem to be stuck anymore (at least lockup watchdog > doesn't report it anymore and I have already rebooted to test with ext3 > :/). I am sorry if the these lockups logs were more confusing than > helpful, but they happened _long_ time ago and the system obviously > recovered from them. I am pasting only the traces for processes in D > state here again for reference. Right, there are various triggers that can get XFS out of the situation - it takes something to kick the log or metadata writeback and that can make space in the log free up and hence things get moving again. The problem will be that once in this low memory state everything in the filesystem will back up on slow memory allocation and it might take minutes to clear the backlog of IO completions.... > 20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs] > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs] > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs] > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs] That is the stack of a process waiting for log space to come available. > We are wating for page under writeback but neither of the 2 paths starts > in xfs code. So I do not think waiting for PageWriteback causes a > deadlock here. The problem is this: the page that we are waiting for IO on is in the IO completion queue, but the IO compeltion requires memory allocation to complete the transaction. That memory allocation is causing memcg reclaim, which then waits for IO completion on another page, which may or may not end up in the same IO completion queue. The CMWQ can continue to process new Io completions - up to a point - so slow progress will be made. In the worst case, it can deadlock. GFP_NOFS allocation is the mechanism by which filesystems are supposed to be able to avoid this recursive deadlock... > [...] > > ... is running IO completion work and trying to commit a transaction > > that is blocked in memory allocation which is waiting for IO > > completion. It's disappeared up it's own fundamental orifice. > > > > Ok, this has absolutely nothing to do with the LRU changes - this is > > a pre-existing XFS/mm interaction problem from around 3.2. The > > question is now this: how the hell do I get memory allocation to not > > block waiting on IO completion here? This is already being done in > > GFP_NOFS allocation context here.... > > Just for reference. wait_on_page_writeback is issued only for memcg > reclaim because there is no other throttling mechanism to prevent from > too many dirty pages on the list, thus pre-mature OOM killer. See > e62e384e9d (memcg: prevent OOM with too many dirty pages) for more > details. The original patch relied on may_enter_fs but that check > disappeared by later changes by c3b94f44fc (memcg: further prevent OOM > with too many dirty pages). Aye. That's the exact code I was looking at yesterday and wondering "how the hell is waiting on page writeback valid in GFP_NOFS context?". It seems that memcg reclaim is intentionally ignoring GFP_NOFS to avoid OOM issues. That's a memcg implementation problem, not a filesystem or LRU infrastructure problem.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>