On Wed, Aug 29, 2018 at 03:34:05PM -0400, Waiman Long wrote: > On 08/28/2018 09:02 PM, Dave Chinner wrote: > > On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: > > And then there's shrinker behaviour. What happens if the shrinker > > isolate callback returns LRU_ROTATE on a negative dentry? It gets > > moved to the most recent end of the list, so it won't have an > > attempt to reclaim it again until it's tried to reclaim all the real > > dentries. IOWs, it goes back to behaving like LRUs are supposed to > > behaving. > > > > IOWs, reclaim behaviour of negative dentries will be > > highly unpredictable, it will not easily retain a working set, nor > > will the working set it does retain be predictable or easy to eject > > from memory when the workload changes. > > > > Is this the behavour what we want for negative dentries? > > I am aware that the behavior is not strictly LRU for negative dentries. > This is a compromise for using one LRU list for 2 different classes of > dentries. Thus demonstrating just enough knowledge to be dangerous. We already 3 different classes of dentries on the LRU list: - in use dentries (because we use lazy removal to avoid lru list lock contention on cache hit lookups) - unused, referenced dentries - unused, unreferenced dentries. Each of these classes of dentries are treated differently by the shrinker, but the key point is that they are all aged the same way and so there's consistent maintenance of the working set under memory pressure. Putting negative dentries at the head of the list doesn't create a new "class" of object on the LRU, it just changes the ordering of the lru list. This will cause unpredictable behaviour because objects haven't had a chance to age gracefully before they are reclaimed. FYI, the inode cache has the same list_lru setup, object types and shrinker algorithm as the dentry cache, so this isn't a one-off. Indeed, the XFS buffer cache has a multi-reference heirarchy of 13 different types of {unused, referenced} buffers in it's list_lru to implement a quasi aging-NFU reclaim algorithm in it's shrinker. i.e. the list_lru infrastructure has never provided or enforced a pure LRU algorithm. It is common infrastructure intended to provide a scalable, flexible and memcg-aware FIFO-like object tracking system that interates tightly with memory reclaim to allow subsystems to implement cache reclaim algorithms that are optimal for that subsystem. IOWs, the list_lru doesn't define the reclaim algorithm the subsystem uses and there's no reason why we can't extend the infrastructure to support more complex algorithms without impacting existing subsystem reclaim algorithms at all. Only the subsystems that use the new infrastructure and algorithms would need careful examination. Of course, the overall system cache balancing behaviour under different and changing workloads would still need to be verified, but you have to do that for any cache reclaim algorithm change that is made.... > The basic idea is that negative dentries that are used only > once will go first irrespective of their age. Using MRU for negative dentries, as I've previously explained, is a flawed concept. It might be expedient to solve your specific problem, but it's not a solution to the general problem of negative dentry management. > > Perhaps a second internal LRU list in the list_lru for "immediate > > reclaim" objects would be a better solution. i.e. similar to the > > active/inactive lists used for prioritising the working set iover > > single use pages in page reclaim. negative dentries go onto the > > immediate list, real dentries go on the existing list. Both are LRU, > > and the shrinker operates on the immediate list first. When we > > rotate referenced negative dentries on the immediate list, promote > > them to the active list with all the real dentries so they age out > > with the rest of the working set. That way single use negative > > dentries get removed in LRU order in preference to the working set > > of real dentries. > > > > Being able to make changes to the list implementation easily was one > > of the reasons I hid the implementation of the list_lru from the > > interface callers use.... > > > > [...] > > I have thought about using 2 lists for dentries. That will require much > more extensive changes to the code and hence much more testing will be > needed to verify their correctness. That is the main reason why I try to > avoid doing that. i.e. expediency. However, you're changing the behaviour of core caching and memory reclaim algorithms. The amount and level of testing and verification you need to do is the same regardless of whether it's a small change or a large change. Sure, you've shown that *one* artificial micro-benchmark improves, but what about everything else? > As you have suggested, we could implement this 2-level LRU list in the > list_lru API. But it is used by other subsystems as well. Extensive > change like that will have similar issue in term of testing and > verification effort. I know what changing reclaim algorithm involves, how difficult it is to balance the competing caches quickly and to the desired ratios for acceptable performance, how difficult it is to measure and control the system reacts to transient and impulse memory pressure events, etc. I also know that "simple" and/or "obviously correct" subsystem changes can cause very unexepected system level effects, and that it's almost never what you think it is that caused the unexpected behaviour. IOWs, getting anything even slightly wrong in these algorithms will adversely affect system performance and balance significantly. Hence the bar is /always/ set high for core caching algorithm changes like this. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx