Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists

"NeilBrown" <neilb@xxxxxxx> · Thu, 06 Feb 2025 14:04:55 +1100

On Thu, 06 Feb 2025, Dave Chinner wrote:
> On Thu, Jan 30, 2025 at 08:34:15AM +1100, NeilBrown wrote:
> > On Tue, 28 Jan 2025, Dave Chinner wrote:
> > > On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > > > [
> > > > davec added to cc incase I've said something incorrect about list_lru
> > > > 
> > > > Changes in this version:
> > > >   - no _bh locking
> > > >   - add name for a magic constant
> > > >   - remove unnecessary race-handling code
> > > >   - give a more meaningfule name for a lock for /proc/lock_stat
> > > >   - minor cleanups suggested by Jeff
> > > > 
> > > > ]
> > > > 
> > > > The nfsd filecache currently uses  list_lru for tracking files recently
> > > > used in NFSv3 requests which need to be "garbage collected" when they
> > > > have becoming idle - unused for 2-4 seconds.
> > > > 
> > > > I do not believe list_lru is a good tool for this.  It does not allow
> > > > the timeout which filecache requires so we have to add a timeout
> > > > mechanism which holds the list_lru lock while the whole list is scanned
> > > > looking for entries that haven't been recently accessed.  When the list
> > > > is largish (even a few hundred) this can block new requests noticably
> > > > which need the lock to remove a file to access it.
> > > 
> > > Looks entirely like a trivial implementation bug in how the list_lru
> > > is walked in nfsd_file_gc().
> > > 
> > > static void
> > > nfsd_file_gc(void)
> > > {
> > >         LIST_HEAD(dispose);
> > >         unsigned long ret;
> > > 
> > >         ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > >                             &dispose, list_lru_count(&nfsd_file_lru));
> > > 			              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > >         trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > >         nfsd_file_dispose_list_delayed(&dispose);
> > > }
> > > 
> > > i.e. the list_lru_walk() has been told to walk the entire list in a
> > > single lock hold if nothing blocks it.
> > > 
> > > We've known this for a long, long time, and it's something we've
> > > handled for a long time with shrinkers, too. here's the typical way
> > > of doing a full list aging and GC pass in one go without excessively
> > > long lock holds:
> > 
> > "Typical"?? Can you please point me to an existing example?
> 
> Of the top of my head: shrink_dcache_sb().

shrink_dcache_sb() contains:

	} while (list_lru_count(&sb->s_dentry_lru) > 0);

i.e. it removes *everything* from the list_lru.

There are 5 callers of list_all_walk() in the kernel.

binder_shrink_scan() appears to be buggy.  Other _scan functions
   use list_lru_shrink_walk which is per-node as you say.
binder_selftest_free_page() calls until the list is empty
shrink_dcache_sb() calls until the list is empty
nfsd_file_gc() call for the whole list but does NOT want the
  list to necessarily be empty
xfs_bufarg_drain() calls until the list is empty

So three call until the list is empty, one should use
list_lru_shrink_walk(), and nfsd wants something quite different.

> 
> However, *every single shrinker implementation in the kernel* uses
> this algorithm whether they use list-lru or not.

Yes, but we aren't doing shrinking here.  We are doing ageing.  We want
a firm upper limit to how long things remain cached after the last use.

> 
> > list_lru_walk() iterates over the multiple node lists in a fixed order.
> > Suppose you have 4 nodes, each with 32 items, all of them referenced, and
> > a batch size of 64.
> > The first batch will process the 32 items on the first list clearing the
> > referenced bit and moving them to the end of that list.  Then continue
> > through those 32 again and freeing them all.  The second batch will do the
> > same for the second list.  The last two lists won't be touched.
> > 
> > list_lru_walk() is only ever used (correctly) for clearing out a
> > list_lru.  It should probably be replaced by a function with a more apt
> > name which is targeted at this: no spinlocks, no return value from the
> > call-back.
> > 
> > Yes, we could change the above code to use list_lru_walk_node and wrap a
> > for loop around the whole, but then I wonder if list_lru is really
> > giving us anything of value.
> 
> Apart from scalability and the ability for memory reclaim to do sane
> per-node object reclaim? What about the fact that anyone familiar
> with list-lru doesn't need to look at how the lists are implemented
> to know how the code behaves?

What an odd thing to say...  how does one become familiar except by
looking at the code?

I accept that my current approach loses per-node object reclaim.  I
don't know how important that is.  I believe the primary cost of nfsd
leaving files on the lru is not the memory usage, but the fact the file
file is held open while not is use.

As I said before, nfsd already has some per-node awareness.  Linking the
file ageing into that would be easy enough.

> 
> Using common infrastructure, even when it's not an exact perfect
> fit, holds a lot more value to the wider community than a one-off
> special snowflake implementation that only one or two people
> understand....

While there is certainly truth in that, it is also true that using
common infrastructure in a non-ideomatic way can cause confusion.
Sometimes things that are different should look different.