Re: [LSF/MM TOPIC] Better handling of negative dentries

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Wed, 16 Mar 2022 11:08:51 +0800

On Wed, Mar 16, 2022 at 01:52:23PM +1100, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 10:07:19AM +0800, Gao Xiang wrote:
> > On Tue, Mar 15, 2022 at 01:56:18PM -0700, Roman Gushchin wrote:
> > > 
> > > > On Mar 15, 2022, at 12:56 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > The number of negative dentries is effectively constrained only by memory
> > > > size.  Systems which do not experience significant memory pressure for
> > > > an extended period can build up millions of negative dentries which
> > > > clog the dcache.  That can have different symptoms, such as inotify
> > > > taking a long time [1], high memory usage [2] and even just poor lookup
> > > > performance [3].  We've also seen problems with cgroups being pinned
> > > > by negative dentries, though I think we now reparent those dentries to
> > > > their parent cgroup instead.
> > > 
> > > Yes, it should be fixed already.
> > > 
> > > > 
> > > > We don't have a really good solution yet, and maybe some focused
> > > > brainstorming on the problem would lead to something that actually works.
> > > 
> > > I’d be happy to join this discussion. And in my opinion it’s going beyond negative dentries: there are other types of objects which tend to grow beyond any reasonable limits if there is no memory pressure.
> > 
> > +1, we once had a similar issue as well, and agree that is not only
> > limited to negative dentries but all too many LRU-ed dentries and inodes.
> 
> Yup, any discussion solely about managing buildup of negative
> dentries doesn't acknowledge that it is just a symptom of larger
> problems that need to be addressed.
> 
> > Limited the total number may benefit to avoid shrink spiking for servers.
> 
> No, we don't want to set hard limits on object counts - that's just
> asking for systems that need frequent hand tuning and are impossible
> to get right under changing workloads. Caches need to auto size
> according to workload's working set to find a steady state balance,
> not be bound by artitrary limits.
> 
> But even cache sizing isn't the problem here - it's just another
> symptom.
> 
> > > A perfect example when it happens is when a machine is almost
> > > idle for some period of time. Periodically running processes
> > > creating various kernel objects (mostly vfs cache) which over
> > > time are filling significant portions of the total memory. And
> > > when the need for memory arises, we realize that the memory is
> > > heavily fragmented and it’s costly to reclaim it back.
> 
> Yup, the underlying issue here is that memory reclaim does nothing
> to manage long term build-up of single use cached objects when
> *there is no memory pressure*. There's of idle time and spare
> resources to manage caches sanely, but we don't. e.g. there is no
> periodic rotation of caches that could lead to detection and reclaim
> of single use objects (say over a period of minutes) and hence
> prevent them from filling up all of memory unnecessarily and
> creating transient memory reclaim and allocation latency spikes when
> memory finally fills up.
> 
> IOWs, negative dentries getting out of hand and shrinker spikes are
> both a symptom of the same problem: while memory allocation is free,
> memory reclaim does nothing to manage cache aging. Hence we only
> find out we've got a badly aged cache when we finally realise
> it has filled all of memory, and then we have heaps of work to do
> before memory can be made available for allocation again....
> 
> And then if you're going to talk memory reclaim, the elephant in the
> room is the lack of integration between shrinkers and the main
> reclaim infrastructure.  There's no priority determination, there's
> no progress feedback, there's no mechanism to allow shrinkers to
> throttle reclaim rather than have the reclaim infrastructure wind up
> priority and OOM kill when a shrinker cannot make progress quickly,
> etc. Then there's direct reclaim hammering shrinkers with unbound
> concurrency so individual shrinkers have no chance of determining
> how much memory pressure there really is by themselves, not to
> mention the lock contention problems that unbound reclaim
> concurrency on things like LRU lists can cause. And, of course,
> memcg based reclaim is still only tacked onto the side of the
> shrinker infrastructure...

Yeah, it's really a generic problem between objects and shrinkers.
Some intelligent detection and feedback loop (even without memory
pressure) would be much better than hardcoded numbers. Actually I
remembered such topic has been raised for times, hoping for some
progress..

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx