Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 6 Feb 2025 13:21:53 +1100

On Thu, Jan 30, 2025 at 08:34:15AM +1100, NeilBrown wrote:
> On Tue, 28 Jan 2025, Dave Chinner wrote:
> > On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > > [
> > > davec added to cc incase I've said something incorrect about list_lru
> > > 
> > > Changes in this version:
> > >   - no _bh locking
> > >   - add name for a magic constant
> > >   - remove unnecessary race-handling code
> > >   - give a more meaningfule name for a lock for /proc/lock_stat
> > >   - minor cleanups suggested by Jeff
> > > 
> > > ]
> > > 
> > > The nfsd filecache currently uses  list_lru for tracking files recently
> > > used in NFSv3 requests which need to be "garbage collected" when they
> > > have becoming idle - unused for 2-4 seconds.
> > > 
> > > I do not believe list_lru is a good tool for this.  It does not allow
> > > the timeout which filecache requires so we have to add a timeout
> > > mechanism which holds the list_lru lock while the whole list is scanned
> > > looking for entries that haven't been recently accessed.  When the list
> > > is largish (even a few hundred) this can block new requests noticably
> > > which need the lock to remove a file to access it.
> > 
> > Looks entirely like a trivial implementation bug in how the list_lru
> > is walked in nfsd_file_gc().
> > 
> > static void
> > nfsd_file_gc(void)
> > {
> >         LIST_HEAD(dispose);
> >         unsigned long ret;
> > 
> >         ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> >                             &dispose, list_lru_count(&nfsd_file_lru));
> > 			              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> >         trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> >         nfsd_file_dispose_list_delayed(&dispose);
> > }
> > 
> > i.e. the list_lru_walk() has been told to walk the entire list in a
> > single lock hold if nothing blocks it.
> > 
> > We've known this for a long, long time, and it's something we've
> > handled for a long time with shrinkers, too. here's the typical way
> > of doing a full list aging and GC pass in one go without excessively
> > long lock holds:
> 
> "Typical"?? Can you please point me to an existing example?

Of the top of my head: shrink_dcache_sb().

However, *every single shrinker implementation in the kernel* uses
this algorithm whether they use list-lru or not.

i.e. this "iterate list in small batchs to minimise lock hold
latency" algorithm has been used by the shrinker infrastructure
since the 2.5.x days.

list-lru was designed explicitly for use with shrinker walk
algorithms, so -typical- usage it to walk in small batches unless
it is known that there are no concurrent accesses to the list
possible. (e.g. during teardown).

Just because you don't know about how list_lru is typically used,
doesn't mean that there aren't typical uses...

> > IOWs, a batched walk like above resumes the walk exactly where it
> > left off, because it is always either reclaiming or rotating the
> > object at the head of the list.
> 
> "rotating the object" to the head of the per-node list, not the head of
> the whole list_Lru (except in the trivial single-node case).

Yup. That's because shrinkers are numa node specific. i.e. memory
reclaim is not global, it is per-node and we have one list_lru list
per node. IOWs, the structure of list-lru is intended to be optimal
for NUMA aware memory reclaim algorithms.

Most important VFS and filesystem caches ceased to have global LRU
ordering a -long- time ago. Scalability to really large machines is
far more important than strict global LRU maintenance.

> list_lru_walk() iterates over the multiple node lists in a fixed order.
> Suppose you have 4 nodes, each with 32 items, all of them referenced, and
> a batch size of 64.
> The first batch will process the 32 items on the first list clearing the
> referenced bit and moving them to the end of that list.  Then continue
> through those 32 again and freeing them all.  The second batch will do the
> same for the second list.  The last two lists won't be touched.
> 
> list_lru_walk() is only ever used (correctly) for clearing out a
> list_lru.  It should probably be replaced by a function with a more apt
> name which is targeted at this: no spinlocks, no return value from the
> call-back.
> 
> Yes, we could change the above code to use list_lru_walk_node and wrap a
> for loop around the whole, but then I wonder if list_lru is really
> giving us anything of value.

Apart from scalability and the ability for memory reclaim to do sane
per-node object reclaim? What about the fact that anyone familiar
with list-lru doesn't need to look at how the lists are implemented
to know how the code behaves?

Using common infrastructure, even when it's not an exact perfect
fit, holds a lot more value to the wider community than a one-off
special snowflake implementation that only one or two people
understand....

> Walking a linked list just to set a bit in ever entry is a lot more work
> that a few list manipulation in 5 cache-lines.

Maybe so, but the cache gc isn't a performance critical path.

> > > This patch removes the list_lru and instead uses 2 simple linked lists.
> > > When a file is accessed it is removed from whichever list it is on,
> > > then added to the tail of the first list.  Every 2 seconds the second
> > > list is moved to the "freeme" list and the first list is moved to the
> > > second list.  This avoids any need to walk a list to find old entries.
> > 
> > Yup, that's exactly what the current code does via the laundrette
> > work that schedules nfsd_file_gc() to run every two seconds does.
> > 
> > > These lists are per-netns rather than global as the freeme list is
> > > per-netns as the actual freeing is done in nfsd threads which are
> > > per-netns.
> > 
> > The list_lru is actually multiple lists - it is a per-numa node list
> > and so moving to global scope linked lists per netns is going to
> > reduce scalability and increase lock contention on large machines.
> 
> Possibly we could duplicate the filecache_disposal structure across
> svc_pools instead of net namespaces.  That would fix the scalability
> issue.  Probably we should avoid removing and re-adding the file in
> the lru for every access as that probably causes more spinlock
> contention.  We would need to adjust the ageing mechanism but i suspect
> it could be made to work.

Typical usage of list-lru is lazy removal. i.e. we only add it to
the LRU list if it's not already there, and only reclaim/gc removes
it from the list.  This is how the inode and dentry caches have
worked since before list_lru even existed, and it's a pattern that
is replicated across many subsystems that use LRUs for gc
purposes...

IOWs, the "object in cache" lookup fast path should never touch
the LRU at all.

> > i.e. It's kinda hard to make any real comment on "I do not believe
> > list_lru is a good tool for this" when there is no actual
> > measurements provided to back the statement one way or the other...
> 
> It's not about measurements, its about behavioural correctness.  Yes, I
> should have spelled that out better.  Thanks for encouraging me to do
> so.

So you are saying that you don't care that the existing code can
easily be fixed, that your one-off solution won't scale as well and
is less functional from memory reclaim POV, and that the new
implementation is less maintainable than using generic
infrastructure to do the same work.

If that's the approach you are taking, then I don't know why you
asked me to point out all the stuff about list_lru that you didn't
seem to know about in the first place...

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx