Re: [PATCH v7 18/34] fs: convert fs shrinkers to new scan/count API

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 21 May 2013 09:38:07 +1000

On Mon, May 20, 2013 at 07:25:47PM +0400, Glauber Costa wrote:
> On 05/20/2013 05:46 PM, Glauber Costa wrote:
> > On 05/20/2013 12:25 PM, Steven Whitehouse wrote:
> >> Hi,
> >>
> >> On Mon, 2013-05-20 at 00:07 +0400, Glauber Costa wrote:
> >>> From: Dave Chinner <dchinner@xxxxxxxxxx>
> >>>
> >>> Convert the filesystem shrinkers to use the new API, and standardise
> >>> some of the behaviours of the shrinkers at the same time. For
> >>> example, nr_to_scan means the number of objects to scan, not the
> >>> number of objects to free.
> >>>
> >>> I refactored the CIFS idmap shrinker a little - it really needs to
> >>> be broken up into a shrinker per tree and keep an item count with
> >>> the tree root so that we don't need to walk the tree every time the
> >>> shrinker needs to count the number of objects in the tree (i.e.
> >>> all the time under memory pressure).
> >>>
> >>> [ glommer: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are
> >>>   needed mainly due to new code merged in the tree ]
> >>> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> >>> Signed-off-by: Glauber Costa <glommer@xxxxxxxxxx>
> >>> Acked-by: Mel Gorman <mgorman@xxxxxxx>
> >>> Acked-by: Artem Bityutskiy <artem.bityutskiy@xxxxxxxxxxxxxxx>
> >>> Acked-by: Jan Kara <jack@xxxxxxx>
> >>> CC: Steven Whitehouse <swhiteho@xxxxxxxxxx>
> >>> CC: Adrian Hunter <adrian.hunter@xxxxxxxxx>
> >>> ---
> >>>  fs/ext4/extents_status.c | 30 ++++++++++++++++------------
> >>>  fs/gfs2/glock.c          | 28 +++++++++++++++-----------
> >>>  fs/gfs2/main.c           |  3 ++-
> >>>  fs/gfs2/quota.c          | 12 +++++++-----
> >>>  fs/gfs2/quota.h          |  4 +++-
> >>>  fs/mbcache.c             | 51 ++++++++++++++++++++++++++++--------------------
> >>>  fs/nfs/dir.c             | 18 ++++++++++++++---
> >>>  fs/nfs/internal.h        |  4 +++-
> >>>  fs/nfs/super.c           |  3 ++-
> >>>  fs/nfsd/nfscache.c       | 31 ++++++++++++++++++++---------
> >>>  fs/quota/dquot.c         | 34 +++++++++++++++-----------------
> >>>  fs/ubifs/shrinker.c      | 20 +++++++++++--------
> >>>  fs/ubifs/super.c         |  3 ++-
> >>>  fs/ubifs/ubifs.h         |  3 ++-
> >>>  14 files changed, 151 insertions(+), 93 deletions(-)
> >> [snip]
> >>>  		return 0;
> >>> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> >>> index 3bd2748..4ddbccb 100644
> >>> --- a/fs/gfs2/glock.c
> >>> +++ b/fs/gfs2/glock.c
> >>> @@ -1428,21 +1428,22 @@ __acquires(&lru_lock)
> >>>   * gfs2_dispose_glock_lru() above.
> >>>   */
> >>>  
> >>> -static void gfs2_scan_glock_lru(int nr)
> >>> +static long gfs2_scan_glock_lru(int nr)
> >>>  {
> >>>  	struct gfs2_glock *gl;
> >>>  	LIST_HEAD(skipped);
> >>>  	LIST_HEAD(dispose);
> >>> +	long freed = 0;
> >>>  
> >>>  	spin_lock(&lru_lock);
> >>> -	while(nr && !list_empty(&lru_list)) {
> >>> +	while ((nr-- >= 0) && !list_empty(&lru_list)) {
> >>>  		gl = list_entry(lru_list.next, struct gfs2_glock, gl_lru);
> >>>  
> >>>  		/* Test for being demotable */
> >>>  		if (!test_and_set_bit(GLF_LOCK, &gl->gl_flags)) {
> >>>  			list_move(&gl->gl_lru, &dispose);
> >>>  			atomic_dec(&lru_count);
> >>> -			nr--;
> >>> +			freed++;
> >>>  			continue;
> >>>  		}
> >>>  
> >>
> >> This seems to change behaviour so that nr is no longer the number of
> >> items to be demoted, but instead the max number of items to scan in
> >> order to look for items to be demoted. Does that mean that nr has
> >> changed its meaning now?

The original code above assumes that nr_to_scan == "number of object
to free". That is wrong: nr_to_scan has always been defined as "the
number of objects to look at to *to try* to free".

Why is nr_to_scan the number of objects to look at? Because it we
take it as the number of objects to free, then we can walk the
entire LRU list looking for objects to free in a single callout.
And when you have several million objects on the LRU list and none
of them are freeable, that adds a huge amount of CPU overhead to the
shrinker scan loop.

And if it really means "free this many objects", then by that
definition stuff like referenced bits in objects are meaningless
(like the inode and dentry caches use) because we simply rotate the
entire referenced objects on the list and make another pass across
the list and start freeing them until we've freed nr_to_scan
objects....

So, nr_to_scan is the number of objects in the cache to *look at*,
not the number of objects to free.

> >> Steve.
> >>
> > No, this should be the max number to be demoted, no change.
> > This test above should then be freed < nr.
> > 
> > I will update, thanks for spotting.
> > 
> Dave,
> 
> I am auditing the other conversions now for patterns like this.
> In the ashmem driver, you wrote:
> 
> - * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to
> query how
> - * many objects (pages) we have in total.
> + * 'nr_to_scan' is the number of objects to scan for freeing.
> 
> Can you please clarify what is your intention here? For me, nr_to_scan
> is still the amount we should try to free - and it has been this way for
> a while - even if we have to *scan* more objects than this, because
> some of them cannot be freed.

Then your understanding of what nr_to_scan means is wrong. See
above.

> In the shrinkers you have been converting, I actually found both kinds
> of behaviors: In some of them you test for freed >= nr_to_scan, and in
> others, --nr_to_scan > 0

That depended on the existing code in the shrinkers and whether it
was obvious or possible to make them behave properly. The ones that
check freed >= nr_to_scan are ones where I've just converted the
interface, and not touched the actual shrinker implementation.

e.g. the ubifs shrinker - I've got no idea exactly what that is
doing, what it is freeing or how much each of those function calls
frees. So rather than screw with something I didn't have the
patience to understand or infratructure to test, I simply changed
the interface. i.e. the shrinker behaves the same as it did before
the interface change.

> My assumption is that scanning the objects is cheap comparing to the
> other cache operations we do, filling or unfilling, so whoever could,
> should incur the extra couple of scans to make sure that we free as many
> objects as requested *if we can*.

No, scanning is extremely expensive. Look at the dcache - we have to
take a spinlock on every object we scan. So if we can't free any
objects at the tail of the LRU, then just how far down that list of
millions of dentries should we walk to find freeable dentries?

This is the whole point of the batch based scan loop - if we have
lots of scanning to do, we do multiple scan callouts. It is up to
the shrinker to either free or rotate unfreeable objects to the
other end of the LRU so they don't get repeatedly scanned by
repeated shrinker scan calls.

The shrinker does not need to guarantee progress or objects are
being freed. We take great advantage of this in the XFS metadata
buffer shrinker for applying heirarchical reclaim priorities to
ensure that important metadata (e.g. btree roots) are not reclaimed
by less important metadata (e.g. btree leaves). We do this inside a
scan pass simply by decrementing the "LRU age" of the buffer and
rotates them to the other end of the LRU. They only get freed when
the "LRU age" reaches zero.

This is simply a more expansive implementation of the inode/dcache
referenced bit handling, but it reinforces the fact that nr_to_scan
!= nr_to_free.

> I am changing all shrinkers now to behave consistently like this, i.e.
> bailing out in freed > nr_to_scan instead of nr_to_scan == 0 (for most
> of them is thankfully obvious, those two things being the same).

No, that's wrong. A shrinker is allowed to free more than it
scanned. This often happens with the dentry cache (parents get
freed), but they aren't directly accounted by the shrinker in this
case.

What it comes down to is what is reported to the shrinker as an
object count, and what is actually scanned to to free those objects.
Several shrinkers count pages, but scan objects that point to an
arbitrary number of pages. Some of those objects amy be freeable,
some may not, but there is not a 1:1 relationship between objects
being scanned and the objects being counted/freed by the shrinker.

Hence a blanket freed > nr_to_scan abort is not the correct thing to
do. The shrinkers that have this sort of non-linear object:cache
size relationship need more change in the shrinker interface to work
correctly, and so all I've done with them is maintain the status
quo. This patchset is for changing infrastructure, not for trying to
fix all the brokenness in the existing shrinkers.

Remember: the count/scan change is only the first step in this
process. Once this is done we can start to move away from the object
based accounting and scanning to something more appropriate for
these non-linear caches. e.g. byte based accounting and scanning.
The shrinker interface is currently very slab-cache centric, and
that is one of the reasons there are all these hacky shrinkers
because the caches they track don't fit into the neat linear slab
object model.

> If you for any reason wanted nr_to_scan to mean # of objects *scanned*,
> not freed, IOW, if this is not a mistake, please say so and justify.

Justification presented. nr_to_scan has *always* meant "# of objects
*scanned*", and this patchset does not change that.

Cheers,

Dave.
> 
> Thanks.
> 
> 
> 
> 

-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html