On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote: > On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote: > > I can't speak for every slab cache, but I've been coming to the same > > conclusion myself regarding the dentry cache. I think that the rate of > > stepping through the LRU should be tied to the rate of allocations. > > Truly in-use objects shouldn't be harmed by this, as they should get > > referenced and rotated to the beginning of the LRU. But the one-offs > > which are bloating the cache will be found and removed. > > I agree with all this. Same here. > > I've implemented a version of this patch which just takes one step > > through the LRU on each d_alloc. It's quite interesting to watch it > > work. You can create 5 million negative dentries in directory /A via > > stat(), and then create 5 million negative dentries in directory /B. The > > total dentry slab size reaches 5 million but never goes past it, since > > the old negative dentries from /A aren't really in use, and they get > > pruned at the same rate as negative dentries from /B get created. On the > > other hand, if you *continue* to stat() on the dentries of /A while you > > create negative dentries in /B, then the cache grows to 10 million, > > since the /A dentries are actually in use. > > > > Maybe a solution could involve some generic list_lru machinery that can > > let you do these sorts of incremental scans? Maybe batching them up so > > instead of doing one every allocation, you do them every 100 or 1000? > > It would still be up to the individual user to put this to good use in > > the object allocation path. > > I feel like we need to allow the list to both shrink and grow, depending > on how useful the entries in it are. So one counter per LRU, incremented > on every add. When that counter gets to 100, reset it to 0 and scan > 110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be. > But the list can shrink over time instead of being a "one in, one out" > scenario. Yes, this is pretty much what I've been saying we should be using the list-lru for since .... Well, let's just say it was one of the things I wanted to be able to do when I first created the list-lru infrastructure. But it is much more complex than this. One of the issues with purely list-lru add-time accounting is that we cannot make reclaim decisions from list-add context because the list-add can occur in reclaim context. e.g. dentry reclaim will drop the last reference to an inode, which then gets inserted into the the inode list-lru in reclaim context. The very next thing the superblock shrinker does is scan the inode cache list-lru and remove a pre-calculated number of objects from the list. Hence in reclaim context we can be both increasing and decreasing the size of the list-lru by significant percentages in a very short time window. This means it will be quite challenging to make clear decisions based purely on lru list add operations. Hence I think we want to connect more than just the unused entries to periodic reclaim - we really need to know both the number of free objects on the list-lru as well as the total number of objects allocated that could be on the list_lru. This would give us some comparitive measure of free objects vs active referenced objects and that would allow better decisions to be made. As it is, we've recently made a necessary connection between allocation and the list-lru via kmem_cache_alloc_lru(). This was done as part of the list-lru/memcg rework patchset I referenced earlier in the thread. This means that operations that slab objects that are kept on list_lrus for caching are now supplied with the list_lru at allocation time. We already use this API for inode caches (via inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we already have the infrastructure in place to do per-allocation list-lru accounting for inodes and dentries, not just "per list-lru add/remove" accounting. Extending that to other slab-based list-lru users should be pretty easy, and in doing so would remove another difference between memcg and non-memcg aware list-lrus.... > Clearly 110 is a magic number, but intuitively, attempting to shrink > by 10% feels reasonable. Need to strike a balance between shrinking > quickly enough and giving the cache time to figure out which entries > are actually useful. Testing will teach us where the thresholds need to be pretty quickly. :) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx