Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 10 Feb 2025 09:25:06 -0500

On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> On Mon, 10 Feb 2025, Chuck Lever wrote:
> > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > 
> > > > > 1/ When called from the shrinker we want to discard the first few
> > > > >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > >    because they should really be at the end of the LRU as they have been
> > > > >    referenced recently.  So those ones are ROTATED.
> > > > > 
> > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > >    anything that hasn't been used since before the previous call, and
> > > > >    mark everything else as unused at this point in time.
> > > > > 
> > > > > Using the same flag for both of these can result in some unexpected
> > > > > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > really it has.
> > > > > 
> > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > two flags.
> > > > > 
> > > > >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > >      put it there when convenient"
> > > > >  NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > >      run of nfsd_file_gc()
> > > > > 
> > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > should be moved to the end of the LRU and the flag cleared.  This can
> > > > > safely happen at any time.  The actual order on the lru might not be
> > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > 
> > > > > The shrinker callback can ignore the "recent" flag.  If it ends up
> > > > > freeing something that is "recent" that simply means that memory
> > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > the nfsd_file_gc frequency.
> > > > > 
> > > > > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > > > > free everything that doesn't have this flag set, and should clear the
> > > > > flag on everything else.  When it clears the flag it is convenient to
> > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > 
> > > > > With this, calls from the shrinker do not prematurely age files.  It
> > > > > will focus only on freeing those that are least recently used.
> > > > > 
> > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > >  fs/nfsd/filecache.h |  1 +
> > > > >  fs/nfsd/trace.h     |  3 +++
> > > > >  3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > >  }
> > > > >  
> > > > > -
> > > > >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > >  {
> > > > >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > >  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > >  		trace_nfsd_file_lru_add(nf);
> > > > >  		return true;
> > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > >  	return LRU_REMOVED;
> > > > >  }
> > > > >  
> > > > > +static enum lru_status
> > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > +		 void *arg)
> > > > > +{
> > > > > +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > +
> > > > > +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > +		/* "REFERENCED" really means "should be at the end of the LRU.
> > > > > +		 * As we are putting it there we can clear the flag
> > > > > +		 */
> > > > > +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > +		trace_nfsd_file_gc_aged(nf);
> > > > > +		return LRU_ROTATE;
> > > > > +	}
> > > > > +	return nfsd_file_lru_cb(item, lru, arg);
> > > > > +}
> > > > > +
> > > > >  static void
> > > > >  nfsd_file_gc(void)
> > > > >  {
> > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > >  
> > > > >  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > >  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > >  					  &dispose, &nr);
> > > > >  	}
> > > > >  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > --- a/fs/nfsd/filecache.h
> > > > > +++ b/fs/nfsd/filecache.h
> > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > >  #define NFSD_FILE_PENDING	(1)
> > > > >  #define NFSD_FILE_REFERENCED	(2)
> > > > >  #define NFSD_FILE_GC		(3)
> > > > > +#define NFSD_FILE_RECENT	(4)
> > > > >  	unsigned long		nf_flags;
> > > > >  	refcount_t		nf_ref;
> > > > >  	unsigned char		nf_may;
> > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > --- a/fs/nfsd/trace.h
> > > > > +++ b/fs/nfsd/trace.h
> > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > >  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> > > > >  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> > > > >  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> > > > > +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> > > > >  		{ 1 << NFSD_FILE_GC,		"GC" })
> > > > >  
> > > > >  DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > >  
> > > > >  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> > > > >  	TP_ARGS(removed, remaining))
> > > > >  
> > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > >  
> > > > >  TRACE_EVENT(nfsd_file_close,
> > > > 
> > > > The other patches in this series look like solid improvements. This one
> > > > could be as well, but it will take me some time to understand it.
> > > > 
> > > > I am generally in favor of replacing the logic that removes and adds
> > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > with the use of an existing LRU facility while documenting its unique
> > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > 
> > > > I would still prefer the backport to be lighter -- looks like the key
> > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > might be "no", I realize.
> > > 
> > > I'm going with "no".
> > > To be honest, I was hoping that the complexity displayed here needed
> > > to work around the assumptions of list_lru what don't match our needs
> > > would be sufficient to convince you that list_lru isn't worth pursuing. 
> > > I see that didn't work.
> > 
> > Fair enough.
> > 
> > 
> > > So I am no longer invested in this patch set.  You are welcome to use it
> > > if you wish and to make any changes that you think are suitable, but I
> > > don't think it is a good direction to go and will not be offering any
> > > more code changes to support the use of list_lru here.
> > 
> > If I may observe, you haven't offered a compelling explanation of why an
> > imperfect fit between list_lru and the filecache adds more technical
> > debt than does the introduction of a bespoke LRU mechanism.
> > 
> > I'm open to that argument, but I need stronger rationale (or performance
> > data) to back it up. So far I can agree that the defect rate in this
> > area is somewhat abnormal, but that seems to be because we don't
> > understand how to use the list_lru API to its best advantage.
> > 
> 
> I would characterise the cause of the defect rate differently.
> I would say it is because we are treating this as an lru-style problem
> when it isn't an lru-style problem.  list_lru is great for lrus.  That
> isn't what we have.
> 
> What we have is a desire to keep files open between consecutive IO
> requests without any clear indication of when we have seen the last in a
> series of IO requests.  So we make a policy decision "keep files open
> until there have been no IOs for 2 seconds - then close them".
> This is a good and sensible policy that nothing to do with the "LRU"
> concept. 
> 
> We implement this policy by keeping all unused files on a list, set a
> flag every time the file is used, clearing the flag on a timer tick
> (every 2 seconds) and closing anything which still has the flag cleared
> 2 seconds later.
> 
> Still nothing in this description that is at all related to LRU
> concepts.
> 
> Now we decide that it would be good the add a shinker - fair enough as
> we don't *need* these to remain.  How should the shrinker choose files
> to close?  It probably doesn't matter beyond avoiding files that still
> have the not-timed-out flag set.
> 
> But we try to also impose an LRU disciple over the list, and we use
> list_lru.
> The interfaces for list_lru() are well documented but the intent is
> not.  Most users of list_lru (gfs2/quota might be an exception) only
> explicitly delete things from the lru when it is time to discard them
> completely.  They rely on the shrinker to detect things that are in use
> again, and to remove them.  And possibly to detect things that have been
> referenced and to rotate them.  But if the shrinker doesn't run because
> there isn't much memory pressure they are just left alone.
> 
> This is what list_lru is optimised for - for shrinker driven scanning
> which skips or removes or rotates things that can't or shouldn't
> be freed, and frees others.  You would expect to normally only scan a
> small fraction of the list, because realistically you want to keep most
> of them.
> 
> For filecache we don't want to keep them very long.  So I think it
> matters a lot less what we choose for shrinking.  I'm tempted to suggest
> we don't bother with the shrinker.  Old files will be discarded soon
> anyway if they aren't used, and slowness in memory allocation (due to
> any memory pressure) will naturally slow down the addition of new files
> to the cache.  So the cache will shrink naturally.
> 
> I'm not 100% certain of that, but I do think that the needs of the
> shrinker should not dominate the design as they currently do.
> 
> Note that maybe we *don't* need to close files so quickly.  Maybe we
> could discard the whole timer thing, and then it would make sense to use
> list_lru().  What is the cost of keeping them open?
> 
> All I can think of is that it affects unlink.  An unlinked file won't be
> removed while there is a reference to the inode.  Maybe we should
> address that by taking a lease on the file while it is in the
> filecache??  When the lease is broken, we discard the file from the
> cache. 

It may also affect other applications trying to take out leases. The
filecache has the  nfsd_file_lease_notifier that tells it when someone
is trying to take out a lease on a file. That happens then it will try
to close the file first.

>
> If that could work (which might involve creating a new internal lease
> type that is only broken on unlink), then we could remove the timeout
> and leave files in the cache indefinitely.  Then it would make perfect
> sense to use list_lru() because the problem would start to look exactly
> like an LRU problem.  But I don't think that is what we have today.
> 

The filecache already sets a fsnotify_mark on the inode to watch for
its i_nlink to go to 0, and then removes it from the cache when that
happens. I think we could keep these files open for quite a bit longer
if we chose to do so.

One thing that Chuck has brought up a few times is that maybe we should
consider making v4 not use the filecache at all. If that simplifies
things then that might be a good thing to consider as well.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>