On Thu, 06 Feb 2025, Dave Chinner wrote: > On Thu, Jan 30, 2025 at 08:34:15AM +1100, NeilBrown wrote: > > On Tue, 28 Jan 2025, Dave Chinner wrote: > > > On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote: > > > > [ > > > > davec added to cc incase I've said something incorrect about list_lru > > > > > > > > Changes in this version: > > > > - no _bh locking > > > > - add name for a magic constant > > > > - remove unnecessary race-handling code > > > > - give a more meaningfule name for a lock for /proc/lock_stat > > > > - minor cleanups suggested by Jeff > > > > > > > > ] > > > > > > > > The nfsd filecache currently uses list_lru for tracking files recently > > > > used in NFSv3 requests which need to be "garbage collected" when they > > > > have becoming idle - unused for 2-4 seconds. > > > > > > > > I do not believe list_lru is a good tool for this. It does not allow > > > > the timeout which filecache requires so we have to add a timeout > > > > mechanism which holds the list_lru lock while the whole list is scanned > > > > looking for entries that haven't been recently accessed. When the list > > > > is largish (even a few hundred) this can block new requests noticably > > > > which need the lock to remove a file to access it. > > > > > > Looks entirely like a trivial implementation bug in how the list_lru > > > is walked in nfsd_file_gc(). > > > > > > static void > > > nfsd_file_gc(void) > > > { > > > LIST_HEAD(dispose); > > > unsigned long ret; > > > > > > ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > > > &dispose, list_lru_count(&nfsd_file_lru)); > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > > > nfsd_file_dispose_list_delayed(&dispose); > > > } > > > > > > i.e. the list_lru_walk() has been told to walk the entire list in a > > > single lock hold if nothing blocks it. > > > > > > We've known this for a long, long time, and it's something we've > > > handled for a long time with shrinkers, too. here's the typical way > > > of doing a full list aging and GC pass in one go without excessively > > > long lock holds: > > > > "Typical"?? Can you please point me to an existing example? > > Of the top of my head: shrink_dcache_sb(). shrink_dcache_sb() contains: } while (list_lru_count(&sb->s_dentry_lru) > 0); i.e. it removes *everything* from the list_lru. There are 5 callers of list_all_walk() in the kernel. binder_shrink_scan() appears to be buggy. Other _scan functions use list_lru_shrink_walk which is per-node as you say. binder_selftest_free_page() calls until the list is empty shrink_dcache_sb() calls until the list is empty nfsd_file_gc() call for the whole list but does NOT want the list to necessarily be empty xfs_bufarg_drain() calls until the list is empty So three call until the list is empty, one should use list_lru_shrink_walk(), and nfsd wants something quite different. > > However, *every single shrinker implementation in the kernel* uses > this algorithm whether they use list-lru or not. Yes, but we aren't doing shrinking here. We are doing ageing. We want a firm upper limit to how long things remain cached after the last use. > > > list_lru_walk() iterates over the multiple node lists in a fixed order. > > Suppose you have 4 nodes, each with 32 items, all of them referenced, and > > a batch size of 64. > > The first batch will process the 32 items on the first list clearing the > > referenced bit and moving them to the end of that list. Then continue > > through those 32 again and freeing them all. The second batch will do the > > same for the second list. The last two lists won't be touched. > > > > list_lru_walk() is only ever used (correctly) for clearing out a > > list_lru. It should probably be replaced by a function with a more apt > > name which is targeted at this: no spinlocks, no return value from the > > call-back. > > > > Yes, we could change the above code to use list_lru_walk_node and wrap a > > for loop around the whole, but then I wonder if list_lru is really > > giving us anything of value. > > Apart from scalability and the ability for memory reclaim to do sane > per-node object reclaim? What about the fact that anyone familiar > with list-lru doesn't need to look at how the lists are implemented > to know how the code behaves? What an odd thing to say... how does one become familiar except by looking at the code? I accept that my current approach loses per-node object reclaim. I don't know how important that is. I believe the primary cost of nfsd leaving files on the lru is not the memory usage, but the fact the file file is held open while not is use. As I said before, nfsd already has some per-node awareness. Linking the file ageing into that would be easy enough. > > Using common infrastructure, even when it's not an exact perfect > fit, holds a lot more value to the wider community than a one-off > special snowflake implementation that only one or two people > understand.... While there is certainly truth in that, it is also true that using common infrastructure in a non-ideomatic way can cause confusion. Sometimes things that are different should look different.