On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote: > Hi Dave, > > I was exploring to make the "nr_deferred" per memcg. I looked into and > had some prototypes for two approaches so far: > 1. Have per memcg data structure for each memcg aware shrinker, just > like what shrinker_map does. > 2. Have "nr_deferred" on list_lru_one for memcg aware lists. > > Both seem feasible, however the latter one looks much cleaner, I just > need to add two new APIs for shrinker which gets and sets > "nr_deferred" respectively. And, just memcg aware shrinkers need > define those two callouts. We just need to care about memcg aware > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and > workingset) use list_lru, so I'd prefer the latter one. The list_lru is completely separate from the shrinker context. The structure that tracks objects in a subsystem is not visible or aware of how the high level shrinker scanning algorithms work. Not to mention that subsystem shrinkers can be memcg aware without using list_lru structures to index objects owned by a given memcg. Hence I really don't like the idea of tying the shrinker control data deeply into subsystem cache indexing.... > But there are two memcg aware shrinkers are not that straightforward > to deal with: > 1. The deferred split THP. It doesn't use list_lru, but actually I > don't worry about this one, since it is not cache just some partial > unmapped THPs. I could try to convert it to use list_lru later on or > just kill deferred split by making vmscan split partial unmapped THPs. > So TBH I don't think it is a blocker. What a fantastic abuse of the reclaim infrastructure. :/ First it was just defered work. Then it became NUMA_AWARE. THen it became MEMCG_AWARE and.... Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it runs through shrink_slab_memcg() even when memcgs are configured in but kmem tracking disabled. We have heaps of shrinkers that reclaim from things that aren't slab caches, but this one is just nuts. > 2. The fs_objects. This one looks weird. It shares the same shrinker > with inode/dentry. The only user is XFS currently. But it looks it is > not really memcg aware at all, right? It most definitely is. The VFS dentry and inode cache reclaim are memcg aware. The fs_objects callout is for filesystem level object garbage collection that can be done as a result of the dentry and inode caches being reclaimed. i.e. once the VFS has reclaimed the inode attached to the memcg, it is no longer attached and accounted to the memcg anymore. It is owned by the filesystem at this point, and it is entirely up to the filesytem to when it can then be freed. Most filesystems do it in the inode cache reclaim via the ->destroy method. XFS, OTOH, tags freeable inodes in it's internal radix trees rather than freeing them immediately because it still may have to clean the inode before it can be freed. Hence we defer freeing of inodes until the ->fs_objects pass.... > They are managed by radix tree > which is not per memcg by looking into xfs code, so the "per list_lru > nr_deferred" can't work for it. I thought of a couple of ways to > tackle it off the top of my head: > A. Just ignore it. If the amount of fs_objects are negligible > comparing to inode/dentry, then I think it can be just ignored and > kept it as is. Ah, no, they are not negliable. Under memory pressure, the number of objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs objects to be reclaimed. The dentries and VFS inodes are owned by VFS level caches and associated with memcgs, the fs_objects are only visible to the filesystem. > B. Move it out of inode/dentry shrinker. Add a dedicated shrinker > for it, for example, sb->s_fs_obj_shrink. No, they are there because the reclaim has to be kept in exact proportion to the dentry and inode reclaim quantities. That's the reason I put that code there in the first place: a separate inode filesystem cache shrinker just didn't work well at all. > C. Make it really memcg aware and use list_lru. Two things. Firstly, objects are owned by the filesystem at this point, not memcgs. Memcgs were detatched at the VFS inode reclaim layer. Secondly, list-lru does not scale well enough for the use needed by XFS. We use radix trees so we can do lockless batch lookups and IO-efficient inode-order reclaim passes. We also have concurrent reclaim capabilities because of the lockless tag lookup walks. Using a list_lru for this substantially reduces reclaim performance and greatly increases CPU usage of reclaim because of contention on the internal list lru locks. Been there, measured that.... > I don't have any experience on XFS code, #C seems the most optimal, > but should be the most time consuming, I'm not sure if it is worth it > or not. So, #B sounds more preferred IMHO. I think you're going completely in the wrong direction. The problem that needs solving is integrating shrinker scanning control state with memcgs more tightly, not force every memcg aware shrinker to use list_lru for their subsystem shrinker implementations.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx