Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 18 Dec 2019 16:38:32 -0500

On Wed, Dec 18, 2019 at 09:16:26PM +1100, Dave Chinner wrote:
> On Tue, Dec 17, 2019 at 11:37:27PM -0500, Johannes Weiner wrote:
> > On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > > > This problem exists independent of cgroup protection.
> > > > 
> > > > The inode shrinker may take down an inode that's still holding a ton
> > > > of (potentially active) page cache pages when the inode hasn't been
> > > > referenced recently.
> > > 
> > > Ok, please explain to me how are those pages getting repeated
> > > referenced and kept active without referencing the inode in some
> > > way?
> > > 
> > > e.g. active mmap pins a struct file which pins the inode.
> > > e.g. open fd pins a struct file which pins the inode.
> > > e.g. open/read/write/close keeps a dentry active in cache which pins
> > > the inode when not actively referenced by the open fd.
> > > 
> > > AFAIA, all of the cases where -file pages- are being actively
> > > referenced require also actively referencing the inode in some way.
> > > So why is the inode being reclaimed as an unreferenced inode at the
> > > end of the LRU if these are actively referenced file pages?
> > > 
> > > > IMO we shouldn't be dropping data that the VM still considers hot
> > > > compared to other data, just because the inode object hasn't been used
> > > > as recently as other inode objects (e.g. drowned in a stream of
> > > > one-off inode accesses).
> > > 
> > > It should not be drowned by one-off inode accesses because if
> > > the file data is being actively referenced then there should be
> > > frequent active references to the inode that contains the data and
> > > that should be keeping it away from the tail of the inode LRU.
> > > 
> > > If the inode is not being frequently referenced, then it
> > > isn't really part of the current working set of inodes, is it?
> > 
> > The inode doesn't have to be currently open for its data to be used
> > frequently and recently.
> 
> No, it doesn't have to be "open", but it has to be referenced if
> pages are being added to or accessed from it's mapping tree.
> 
> e.g. you can do open/mmap/close, and the vma backing the mmap region
> holds a reference to the inode via vma->vm_file until munmap is
> called and the vma is torn down.
> 
> So:
> 
> > Executables that run periodically come to mind.
> 
> this requires mmap, hence an active inode reference, and so when the
> vma is torn down, the inode is moved to the head of the inode cache
> LRU. IF we keep running that same executable, the inode will be
> repeatedly relocated to the head of the LRU every time the process
> running the executable exits.
> 
> > An sqlite file database that is periodically opened and queried, then
> > closed again.
> 
> dentry pins inode on open, struct file pins inpde until close,
> dentry reference pins inode until shrinker reclaims dentry. Inode
> goes on head of LRU when dentry is reclaimed. Repeated cycles will
> hit either the dentry cache or if that's been reclaimed the inode
> cache will get hit.
> 
> > A git repository.
> 
> same as sqlite case, just with many files.
> 
> IOWs, all of these data references take an active reference to the
> inode and reset it's position in the inode cache LRU when the last
> reference is dropped. If it's a dentry, it may not get dropped until
> memory presure relaims the dentry. Hence inode cache LRU order does
> not reflect the file data page LRU order in any way.
> 
> But my question still stands: how do you get page LRU references
> without inode references? And if you can't, why should having cached
> pages on the oldest unused, unreferenced inode in the LRU prevent
> it's reclaim?

One of us is missing something really obvious here.

Let's say I'm routinely working with a git tree and the objects are
cached by active pages. I'm using a modified mincore() that reports
page active state, so the output here is active/present/filesize:

[hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
17/17/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
97/97/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
21/21/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
69/69/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
223/223/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
261/261/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
48/48/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
40/40/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
16/16/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
28/28/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
4/4/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
46/46/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
166/166/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
12/12/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
42/42/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
38/38/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
129/129/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
8/8/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
4/4/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
62/62/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
96/96/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
129/129/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
333/333/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
6487/6487/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
12260/12260/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
603/603/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
1450/1450/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
657/657/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
1658/1658/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
46037/46037/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
105772/105772/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack

Now something like updatedb, a find or comparable goes off and in a
short amount of time creates a ton of one-off dentries, inodes, and
file cache:

$ find /usr -type f -exec grep -q dave {} \;

LRU reclaim recognizes that the file cache produced by this operation
is not used repeatedly and lets an infinite amount of it pass through
the inactive list without disturbing my git tree workingset.

The inode/dentry reclaim doesn't do the same thing. It looks at the
references and delays the inevitable for a few more items coming
through the LRU, but eventually it lets a bunch of objects that are
only used once push out data that has been used over and over right
before this burst of metadata objects came along.

The VM goes through a ridiculous effort to implement scan resistance:
we split the LRUs into inactive/active lists, we track non-resident
cache information to tell stable states from transitions and carefully
balance the lists agains each other. All in an effort to protect
established workingsets that have proven to benefit from caching from
bursts of one-off entries that do not.

Thousands of lines of complexity, years of labor, to make this work.

And then the inode shrinker just goes and drops it all on the floor:

[hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
0/0/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
0/0/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
0/0/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
0/0/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
0/0/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
0/0/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
0/0/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
0/0/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
0/0/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
0/0/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
0/0/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
0/0/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
0/0/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
0/0/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
0/0/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
0/0/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
0/0/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
0/0/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
0/0/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
0/0/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
0/0/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
0/0/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
0/0/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
0/0/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
0/0/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
0/0/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
0/0/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
0/0/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
0/0/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
0/0/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
0/0/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack

This isn't a theoretical issue. The reason people keep coming up with
the same patch is because they hit exactly this problem in real life.

> > I don't want a find or an updatedb, which doesn't produce active
> > pages, and could be funneled through the cache with otherwise no side
> > effects, kick out all my linux tree git objects via the inode shrinker
> > just because I haven't run a git command in a few minutes.
> 
> That has nothing to do with this patch. updatedb and any file
> traversal that touches data are going to be treated identically to
> you precious working set because they all have nr_pages > 0.
>
> IOWs, this patch does nothing to avoid the problem of single use
> inodes streaming through the inode cache causing the reclaim of all
> inodes. It just changes the reclaim behaviour and how quickly single
> use inodes can be reclaimed. i.e. we now can't reclaim single use
> inodes when they reach the end of the LRU, we have to wait for page
> cache reclaim to free it's pages before the inode can be reclaimed.

Of course it does. LRU reclaim will clean out the single-use pages,
after which those inodes will have !nr_pages and can be reclaimed.

> Further, because inode LRU order is going to be different to page LRU
> order, there's going to be a lot more useless scanning trying to
> find inodes that can be reclaimed. Hence this changes cache balance,
> reduces reclaim efficiency, increases reclaim priority windup as
> less gets freed per scan, and this all ends up causing performance
> and behavioural regressions in unexpected places.

It would be better to keep the inodes off the LRU entirely as long as
they are not considered for reclaim. That would save some CPU churn.

> i.e. this makes the page cache pin the inode in memory and that's a
> major change in bheaviour. that's what caused all the performance
> regressions with workloads that traverse a large single-use file set
> such as a kernel compile - most files and their data are accessed
> just once, and when they get to the end of the inode LRU we really
> want to reclaim them immediately as they'll never get accessed
> again.
> 
> To put it simply, if your goal is to avoid single use inodes from
> trashing a long term working set of cached inodes, then this
> patch does not provide the reliable or predictable object
> management algorithm you are looking for.
>
> If you want to address use-once cache trashing, how about working
> towards a *smarter LRU algorithm* for the list_lru infrastructure?
> Don't hack naive heuristics that "work for me" into the code, go
> back to the algorithm and select something that is provent to
> be resilient against use-once object storms.
> 
> i.e. The requirement is we retain quasi-LRU behaviour, but
> allow use-once objects to cycle through the LRU without disturbing
> frequently/recently referenced/active objects.  The
> per-object reference bit we currently use isn't resilient against
> large-scale use-once object cycling, so we have to improve on that.
> 
> Experience tells me we've solved this problem before, and it's right
> in your area or expertise, too. We could modify the list-lru to use
> a different LRU algorithm that is resilient against the sort of
> flooding you are worried about. We could simply use a double clock
> list like the page LRU uses - we promote frequently referenced
> inodes to the active list when instead of setting a reference bit
> when a reference is dropped and the indoe is on the inactive list.
> And a small part of each shrinker scan count can be used to demote
> the tail of the active list to keep it slowly cycling. This way
> single use inodes will only ever pass through the inactive list
> without perturbing the active list, and we've solved the problem of
> single use inode streams trashing the working cache for all use
> cases, not just one special case....

I'm not opposed to any of this work, but I don't see how it would be a
prerequisite to fixing the aging inversion we're talking about here -
throwing out "unused" containers without looking at what's inside.

On the contrary, the inode scanner would already make better decisions
by simply not discarding all the usage information painstakingly
gathered by the VM.

We can talk about the implementation, of course. Repeatedly skipping
over inodes rather than physically taking them off the list can be a
scalability problem; pushing the shrinker into dirty inodes can be a
problem for certain filesystems. I didn't submit a patch for
upstreaming, I sent a diff hunk to propose an aging hierarchy.

If you agree with my concern about aging decisions here, but think
it's the best we can do given our constraints, we can talk about this
too - but we should at least document the hack currently in place.

If you disagree that the reclaim layering here is fundamentally
problematic, I'm not sure I need to move on with this discussion.

> > > commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> > > Author: Dave Chinner <dchinner@xxxxxxxxxx>
> > > Date:   Tue Feb 12 15:35:51 2019 -0800
> > > 
> > >     Revert "mm: don't reclaim inodes with many attached pages"
> > >     
> > >     This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
> > >     attached pages").
> > >     
> > >     This change causes serious changes to page cache and inode cache
> > >     behaviour and balance, resulting in major performance regressions when
> > >     combining worklaods such as large file copies and kernel compiles.
> > >     
> > >       https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > 
> > I don't remember this, but reading this bugzilla thread is immensely
> > frustrating.
> 
> So you're shooting the messenger as well, eh?
> 
> We went through this whole "blame XFS" circus sideshow when I found
> the commits that caused the regression. It went on right up until
> people using ext4 started reporting similar problems.
> 
> Yes, XFS users were the first to notice the issue, but that does
> not make it an XFS problem!

I cannot find details on the other filesystems in the bug report or
the changelog. Where was the time going? Was it the CPU churn of
skipping over the inodes?

> > We've been carrying this patch here in our tree for over half a decade
> > now to work around this exact stalling in the xfs shrinker:
> >
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index d53a316162d6..45b3a4d07813 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1344,7 +1344,7 @@ xfs_reclaim_inodes_nr(
> >         xfs_reclaim_work_queue(mp);
> >         xfs_ail_push_all(mp->m_ail);
> > 
> > -       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> > +       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> >  }
> > 
> > Because if we don't, our warmstorage machines lock up within minutes,
> > long before Roman's patch.
> 
> Oh, go cry me a river. Poor little FB, has to carry an out-of-tree
> hack that "works for them" because they don't care enough about
> fixing it to help upstream address the underlying memory reclaim 
> problems that SYNC_WAIT flag avoids.
> 
> Indeed, we (XFS devs) have repeatedly provided evidence that this
> patch makes it relatively trivial for users to DOS systems via
> OOM-killer rampages. It does not survive my trivial "fill memory
> with inodes" test without the oom-killer killing the machine, and
> any workload that empties the page cache before the inode cache is
> prone to oom kill because nothing throttles reclaim anymore and
> there are no pages left to reclaim or swap.
>
> It is manifestly worse than what we have now, and that means it is
> not a candidate for merging. We've told FB engineers this
> *repeatedly*, and yet this horrible, broken, nasty, expedient hack
> gets raised every time "shrinker" and "XFS" are mentioned in the
> same neighbourhood.  Just stop it, please.

You don't need to be privileged to cause OOM kills in a myriad of ways
if you tried to.

You don't need to run a malicious workload to have the xfs shrinker
stall out reclaimers in the presence of gigabytes of clean, easy to
reclaim cache.

We fundamentally disagree on what the horrible, broken, nasty,
expedient hack is.

> > The fact that xfs stalls on individual inodes while there might be a
> > ton of clean cache on the LRUs is an xfs problem, not a VM problem.
> 
> No, at it's core it is a VM problem, because if we don't stall on
> inode reclaim in XFS then memory reclaim does far worse things to
> your machine than incur an occasional long tail latency.
> 
> You're free to use some other filesystem if you can't wait for
> upstream XFS developers to fix it properly or you can't be bothered
> to review the patches that actually attempt to fix the problem
> properly...

I'm not worried about xfs. I'm worried about these design decisions
bleeding into other parts of reclaim.

> > The right thing to do to avoid stalls in the inode shrinker is to skip
> > over the dirty inodes and yield back to LRU reclaim; not circumvent
> > page aging and drop clean inodes on the floor when those may or may
> > not hold gigabytes of cache data that the inode shrinker knows
> > *absolutely nothing* about.
> 
> *cough* [*]
> 
> https://lore.kernel.org/linux-mm/20191031234618.15403-1-david@xxxxxxxxxxxxx/
> 
> This implements exactly what you suggest - shrinkers that can
> communicate the need for backoffs to the core infrastructure and
> work deferral to kswapd rather than doing it themselves. And it uses
> that capability to implement non-blocking inode reclaim for XFS.

Does that series end in the shrinkers leaving page reclaim to the page
LRU order?

I'm asking facetiously. Don't get me wrong, I'm interested in what
your patchset promises to implement. However, I'm extremely reluctant
to dive into a series of 28 patches if this is how the discussions go.