Dave Chinner <david@xxxxxxxxxxxxx> writes: > On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote: >> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote: >> > I can't speak for every slab cache, but I've been coming to the same >> > conclusion myself regarding the dentry cache. I think that the rate of >> > stepping through the LRU should be tied to the rate of allocations. >> > Truly in-use objects shouldn't be harmed by this, as they should get >> > referenced and rotated to the beginning of the LRU. But the one-offs >> > which are bloating the cache will be found and removed. >> >> I agree with all this. > > Same here. > >> > I've implemented a version of this patch which just takes one step >> > through the LRU on each d_alloc. It's quite interesting to watch it >> > work. You can create 5 million negative dentries in directory /A via >> > stat(), and then create 5 million negative dentries in directory /B. The >> > total dentry slab size reaches 5 million but never goes past it, since >> > the old negative dentries from /A aren't really in use, and they get >> > pruned at the same rate as negative dentries from /B get created. On the >> > other hand, if you *continue* to stat() on the dentries of /A while you >> > create negative dentries in /B, then the cache grows to 10 million, >> > since the /A dentries are actually in use. >> > >> > Maybe a solution could involve some generic list_lru machinery that can >> > let you do these sorts of incremental scans? Maybe batching them up so >> > instead of doing one every allocation, you do them every 100 or 1000? >> > It would still be up to the individual user to put this to good use in >> > the object allocation path. >> >> I feel like we need to allow the list to both shrink and grow, depending >> on how useful the entries in it are. So one counter per LRU, incremented >> on every add. When that counter gets to 100, reset it to 0 and scan >> 110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be. >> But the list can shrink over time instead of being a "one in, one out" >> scenario. > > Yes, this is pretty much what I've been saying we should be using > the list-lru for since .... Well, let's just say it was one of the > things I wanted to be able to do when I first created the list-lru > infrastructure. > > But it is much more complex than this. One of the issues with purely > list-lru add-time accounting is that we cannot make reclaim > decisions from list-add context because the list-add can occur in > reclaim context. e.g. dentry reclaim will drop the last reference > to an inode, which then gets inserted into the the inode list-lru in > reclaim context. The very next thing the superblock shrinker does > is scan the inode cache list-lru and remove a pre-calculated number > of objects from the list. Hence in reclaim context we can be both > increasing and decreasing the size of the list-lru by significant > percentages in a very short time window. This means it will be quite > challenging to make clear decisions based purely on lru list add > operations. Plus, for the dcache, dentries are added to the LRU the first time their reference count drops to zero, but they're not removed if they gain a new reference. So at least for the dentry case, it's not clear that list_lru_add/del would be good indicators of free/in-use object counts. > Hence I think we want to connect more than just the unused entries > to periodic reclaim - we really need to know both the number of free > objects on the list-lru as well as the total number of objects > allocated that could be on the list_lru. This would give us some > comparitive measure of free objects vs active referenced objects > and that would allow better decisions to be made. The dentry branch I have works purely based on total allocated objects: no differentiation between free and active referenced objects. I could hook into dput() where reference counts drop to zero for the other part of the equation, because like I said, list_lru_{add,del} aren't reliable for the dentry cache. I have opinions about whether or not it would be helpful to add in the dput() signal, but I'd rather just try it and see. I'm learning that my opinion and intuition are not all that reliable when it comes to caching and LRU algorithms; trial and error is key. > As it is, we've recently made a necessary connection between > allocation and the list-lru via kmem_cache_alloc_lru(). This was > done as part of the list-lru/memcg rework patchset I referenced > earlier in the thread. > > This means that operations that slab objects that are kept > on list_lrus for caching are now supplied with the list_lru at > allocation time. We already use this API for inode caches (via > inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we > already have the infrastructure in place to do per-allocation > list-lru accounting for inodes and dentries, not just "per list-lru > add/remove" accounting. > > Extending that to other slab-based list-lru users should be pretty > easy, and in doing so would remove another difference between memcg > and non-memcg aware list-lrus.... > >> Clearly 110 is a magic number, but intuitively, attempting to shrink >> by 10% feels reasonable. Need to strike a balance between shrinking >> quickly enough and giving the cache time to figure out which entries >> are actually useful. > > Testing will teach us where the thresholds need to be pretty > quickly. :) 100% (that is, 2 per allocation) is too high in my very cursory trial, the dentry cache didn't seem to grow much during filesystem workloads. 0% (that is, 1 per allocation) worked well enough but like Matthew says, won't tend to shrink the cache when it is necessary. Stephen > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx