Re: memory reclaim problems on fs usage

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 13 Nov 2015 21:19:46 +0900

Dave Chinner wrote:
> So why have we only scanned *176* pages* during reclaim?  On other
> OOM reports in this trace it's as low as 12.  Either that stat is
> completely wrong, or we're not doing sufficient page LRU reclaim
> scanning....
> 
> > [ 9662.234685] MemAlloc-Info: 3 stalling task, 0 dying task, 0 victim task.
> > 
> > vmstat_update() and submit_flushes() remained pending for about 110 seconds.
> > If xlog_cil_push_work() were spinning inside GFP_NOFS allocation, it should be
> > reported as MemAlloc: traces, but no such lines are recorded. I don't know why
> > xlog_cil_push_work() did not call schedule() for so long.
> 
> I'd say it is repeatedly waiting for IO completion on log buffers to
> write out the checkpoint. It's making progress, just if it's taking
> multiple second per journal IO it will take a long time to write a
> checkpoint. All the other blocked tasks in XFS inode reclaim are
> either waiting directly on IO completion or waiting for the log to
> complete a flush, so this really just looks like an overloaded IO
> subsystem to me....

The vmstat statistics can become wrong when vmstat_update() workqueue item
cannot be processed due to in-flight workqueue item not calling schedule().
If in-flight workqueue item (in this case xlog_cil_push_work()) called
schedule(), the pending vmstat_update() workqueue item will be processed
and the vmstat becomes up to dated. Like you expect that xlog_cil_push_work()
was waiting for IO completion on log buffers rather than spinning inside
GFP_NOFS allocation, what should happened is xlog_cil_push_work() called
schedule() and vmstat_update() was processed. But vmstat_update() remained
pending for about 110 seconds. That's strange...

Arkadiusz is trying http://marc.info/?l=linux-mm&m=144725782107096&w=2
which is for making sure that vmstat_update() workqueue item is processed
by changing wait_iff_congested() to call schedule(), and we are waiting
for test results.

Well, one of dependent patches "vmstat: explicitly schedule per-cpu work
on the CPU we need it to run on" might be relevant to this problem.

If http://sprunge.us/GYBb and http://sprunge.us/XWUX solve the problem
(for both with swap case and without swap case), the vmstat statistics
was wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>