Re: + revert-mm-dont-reclaim-inodes-with-many-attached-pages.patch added to -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 30, 2019 at 09:33:34PM -0500, Rik van Riel wrote:
> On Thu, 2019-01-31 at 12:44 +1100, Dave Chinner wrote:
> 
> > Indeed, the fs/inode.c change definitely needs reverting, because
> > that is just *plain wrong* and breaks long-standing memory reclaim
> > behaviour.
> 
> The long-standing behavior may be wrong here, because
> of just how incredibly slow ext4 and XFS are when it
> comes to reclaiming inodes with lots of dirty pages.

That has nothing to do with the reasons that the change that was
made. The reasons the change were to band-aid a regression for a
different change that has clearly introduced multiple regressions.
This change of >15 year old behaviour can't be justified by arguing
that "but what about <this other completely differnt issue>!"

If you have problems with writeback and inode reclaim, do what
everyone else does: write a bug report showing where the problematic
writeback is coming from, explain why it's a problem and propose a
fix. And post it to linux-fsdevel@xxxxxxxxxxxxxxx for review.

Rik, you know how to work with upstream development, I should not
have to be telling you this.

> We have observed some real system stalls when the
> reclaim code hits an ext4 or XFS inode with dozens
> of megabytes of dirty data that needs to be synced
> out before the inode can be reclaimed.

"reclaim code" is a massive foot print. Please be specific.

> Have you observed any regressions due to not
> reclaiming inodes with cached pages attached?

That's not how we justify a change. The burden of proof of
correctness is on the person proposing the change, not on the
reviewer to prove the change is wrong. We have clear indication that
is changes behaviour of common workloads (copying files at the same
time as compiling a kernel is a common thing!), so we should revert
it until we have a set of changes that actually are proved to work
correctly for common workloads.

> > I seriously disagree with shovelling a different, largely untested
> > and contentious change to the shrinker algorithm to try and patch
> > over the symptoms of the original change. It leaves the underlying
> > problem unfixed (dying memcgs need a reaper to shrink the remaining
> > slab objects that pin that specific memcg) and instead plays
> > "whack-a-mole" on what we alreayd know is a fundamentally broken
> > assumption (i.e. that shrinking small slabs more agressively is
> > side-effect free).
> 
> My patch shrinks small slabs with the same pressure
> as larger slabs. It also ensures that slabs from dead
> memcgs will get eventually reclaimed.
> 
> What am I missing?

That "freeable" is not a measure of slab cache size.
It's a measure of how many freeable objects are in the cache.
i.e. a large cache with nothing freeable returns the same value
as a small cache with nothing freeable.

i.e. the basic premise that "freeable is small == this is a small
cache" is flawed.

And then there's the bugs. Don't forget - the same shrinker can run
concurrent on every node via kswapd and be called by every instance
of direct reclaim on every CPU at the same time. (i.e. the unbound
concurrency problem that fucks up inode caches and results in having
to throttle (block) in the shrinker implementation so the working
sets don't get trashed by massively excessive memory reclaim
pressure).

-Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux