Re: [PATCH] mm: disallow direct reclaim page writeback

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Sun, 18 Apr 2010 15:05:26 -0400

On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> The poor IO patterns thing is a regression.  Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to.  AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.

I just know that we XFS guys have been complaining about it a lot..

But that was mostly a tuning issue - before writeout mostly happened
from pdflush.  If we got into kswapd or direct reclaim we already
did get horrible I/O patterns - it just happened far less often.

> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone.  Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from. 
> It's quite possibly livelockable, too.

As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited.  And unless
we get things fixed we will have to do the same for XFS.  I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems.  It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do
anything about it.

> To solve the stack-usage thing: dunno, really.  One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks.

And it doesn't solve other issues, like the whole lock taking problem.

> Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way.  The helper
> thread could of course do writearound.

Allowing the flusher threads to do targeted writeout would be the
best from the FS POV.  We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html