On Sun, 18 Apr 2010 15:05:26 -0400, Christoph Hellwig <hch@xxxxxxxxxxxxx>
wrote:
On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
The poor IO patterns thing is a regression. Some time several years
ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
dirty-page writeback than it used to. AFAIK nobody attempted to work
out why, nor attempted to try to fix it.
I just know that we XFS guys have been complaining about it a lot..
I know also that the ext3 and reisefs guys complained about this issue
as well.
But that was mostly a tuning issue - before writeout mostly happened
from pdflush. If we got into kswapd or direct reclaim we already
did get horrible I/O patterns - it just happened far less often.
Regarding simply not doing any writeout in direct reclaim (Dave's
initial proposal): the problem is that pageout() will clean a page in
the target zone. Normal writeout won't do that, so we could get into a
situation where vast amounts of writeout is happening, but none of it
is cleaning pages in the zone which we're trying to allocate from.
It's quite possibly livelockable, too.
As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited. And unless
we get things fixed we will have to do the same for XFS. I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems. It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do
anything about it.
To solve the stack-usage thing: dunno, really. One could envisage code
which skips pageout() if we're using more than X amount of stack, but
that sucks.
And it doesn't solve other issues, like the whole lock taking problem.
Another possibility might be to hand the target page over
to another thread (I suppose kswapd will do) and then synchronise with
that thread - get_page()+wait_on_page_locked() is one way. The helper
thread could of course do writearound.
Allowing the flusher threads to do targeted writeout would be the
best from the FS POV. We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC²
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@xxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html