Re: [patch 9/9] mm: keep page cache radix tree nodes in check

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 26 Nov 2013 18:00:10 -0500

On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > entries in their place, which are only reclaimed when the inodes
> > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > are still in use after they have a significant amount of their cache
> > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > entries will just sit there and waste memory.  In the worst case, the
> > > > shadow entries will accumulate until the machine runs out of memory.
> ....
> > > ....
> > > > +	radix_tree_replace_slot(slot, page);
> > > > +	if (node) {
> > > > +		node->count++;
> > > > +		/* Installed page, can't be shadow-only anymore */
> > > > +		if (!list_empty(&node->lru))
> > > > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > > > +	}
> > > > +	return 0;
> > > 
> > > Hmmmmm - what's the overhead of direct management of LRU removal
> > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > to avoid having to touch the LRU when adding new references to an
> > > object.....
> > 
> > It's measurable in microbenchmarks, but not when any real IO is
> > involved.  The difference was in the noise even on SSD drives.
> 
> Well, it's not an SSD or two I'm worried about - it's devices that
> can do millions of IOPS where this is likely to be noticable...
> 
> > The other list_lru users see items only once they become unused and
> > subsequent references are expected to be few and temporary, right?
> 
> They go onto the list when the refcount falls to zero, but reuse can
> be frequent when being referenced repeatedly by a single user. That
> avoids every reuse from removing the object from the LRU then
> putting it back on the LRU for every reference cycle...

That's true, but it's less of a concern in the radix_tree_node case
because it takes a full inactive list cycle after a refault before the
node is put back on the LRU.  Or a really unlikely placed partial node
truncation/invalidation (full truncation would just delete the whole
node anyway).

> > We expect pages to refault in spades on certain loads, at which point
> > we may have thousands of those nodes on the list that are no longer
> > reclaimable (10k nodes for about 2.5G of cache).
> 
> Sure, look at the way the inode and dentry caches work - entire
> caches of millions of inodes and dentries often sit on the LRUs. A
> quick look at my workstations dentry cache shows:
> 
> $ at /proc/sys/fs/dentry-state 
> 180108  170596  45      0       0       0
> 
> 180k allocated dentries, 170k sitting on the LRU...

Hm, and a significant amount of those 170k could rotate on the next
shrinker scan due to recent references or do you generally have
smaller spikes?

But as per above I think the case for lazily removing shadow nodes is
less convincing than for inodes and dentries.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>