Re: memory reclaim problems on fs usage

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 13 Nov 2015 07:06:41 +1100

On Thu, Nov 12, 2015 at 11:12:20PM +0900, Tetsuo Handa wrote:
> On 2015/11/12 15:06, Arkadiusz Miśkiewicz wrote:
> >On Wednesday 11 of November 2015, Tetsuo Handa wrote:
> >>Arkadiusz Mi?kiewicz wrote:
> >>>This patch is against which tree? (tried 4.1, 4.2 and 4.3)
> >>
> >>Oops. Whitespace-damaged. This patch is for vanilla 4.1.2.
> >>Reposting with one condition corrected.
> >
> >Here is log:
> >
> >http://ixion.pld-linux.org/~arekm/log-mm-1.txt.gz
> >
> >Uncompresses is 1.4MB, so not posting here.
> >
> Thank you for the log. The result is unexpected for me.
> 
> What I feel strange is that free: remained below min: level.
> While GFP_ATOMIC allocations can access memory reserves, I think that
> these free: values are too small. Memory allocated by GFP_ATOMIC should
> be released shortly, or any __GFP_WAIT allocations would stall for long.
> 
> [ 8633.753528] Node 0 Normal free:128kB min:7104kB low:8880kB
> high:10656kB active_anon:59008kB inactive_anon:75240kB
> active_file:14712kB inactive_file:3256960kB unevictable:0kB
....
> isolated(anon):0kB isolated(file):0kB present:5242880kB
> managed:5109980kB mlocked:0kB dirty:20kB writeback:0kB mapped:7368kB
....
> pages_scanned:176 all_unreclaimable? no

So: we have 3.2GB (800,000 pages) of immediately reclaimable page
cache in this zone (inactive_file) and a GFP_ATOMIC allocation
context so we can only really reclaim clean page cache pages
reliably.

So why have we only scanned *176* pages* during reclaim?  On other
OOM reports in this trace it's as low as 12.  Either that stat is
completely wrong, or we're not doing sufficient page LRU reclaim
scanning....

> [ 9662.234685] MemAlloc-Info: 3 stalling task, 0 dying task, 0 victim task.
> 
> vmstat_update() and submit_flushes() remained pending for about 110 seconds.
> If xlog_cil_push_work() were spinning inside GFP_NOFS allocation, it should be
> reported as MemAlloc: traces, but no such lines are recorded. I don't know why
> xlog_cil_push_work() did not call schedule() for so long.

I'd say it is repeatedly waiting for IO completion on log buffers to
write out the checkpoint. It's making progress, just if it's taking
multiple second per journal IO it will take a long time to write a
checkpoint. All the other blocked tasks in XFS inode reclaim are
either waiting directly on IO completion or waiting for the log to
complete a flush, so this really just looks like an overloaded IO
subsystem to me....

> Well, what steps should we try next for isolating the problem?

I think there's plenty that needs to be explained from this
information. We don't have a memory allocation livelock - slow
progress is being made on reclaiming inodes, but we
clearly have a zone imbalance and direct page reclaim is not freeing
clean inactive pages when there are huge numbers of them available
for reclaim. Somebody who understands the page reclaim code needs to
look closely at the code to explain this behaviour now, then we'll
know what information needs to be gathered next...

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs