On Wed, Sep 30, 2020 at 12:31 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote: > > Hi Dave, > > > > I was exploring to make the "nr_deferred" per memcg. I looked into and > > had some prototypes for two approaches so far: > > 1. Have per memcg data structure for each memcg aware shrinker, just > > like what shrinker_map does. > > 2. Have "nr_deferred" on list_lru_one for memcg aware lists. > > > > Both seem feasible, however the latter one looks much cleaner, I just > > need to add two new APIs for shrinker which gets and sets > > "nr_deferred" respectively. And, just memcg aware shrinkers need > > define those two callouts. We just need to care about memcg aware > > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and > > workingset) use list_lru, so I'd prefer the latter one. > > The list_lru is completely separate from the shrinker context. The > structure that tracks objects in a subsystem is not visible or aware > of how the high level shrinker scanning algorithms work. Not to > mention that subsystem shrinkers can be memcg aware without using > list_lru structures to index objects owned by a given memcg. Hence I > really don't like the idea of tying the shrinker control data deeply > into subsystem cache indexing.... I see your points. Yes, makes sense to me. The list_lru is a common data structure and could be used by other subsystems, not only memcg aware shrinkers. Putting shrinker control data in list_lru seems break the layer. So, option #1 might be more appropriate. The change looks like: struct mem_cgroup_per_node { ... struct memcg_shrinker_map __rcu *shrinker_map; + struct memcg_shrinker_deferred __rcu *shrinker_deferred; ... } > > > > But there are two memcg aware shrinkers are not that straightforward > > to deal with: > > 1. The deferred split THP. It doesn't use list_lru, but actually I > > don't worry about this one, since it is not cache just some partial > > unmapped THPs. I could try to convert it to use list_lru later on or > > just kill deferred split by making vmscan split partial unmapped THPs. > > So TBH I don't think it is a blocker. > > What a fantastic abuse of the reclaim infrastructure. :/ > > First it was just defered work. Then it became NUMA_AWARE. THen it > became MEMCG_AWARE and.... > > Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it > runs through shrink_slab_memcg() even when memcgs are configured in > but kmem tracking disabled. We have heaps of shrinkers that reclaim > from things that aren't slab caches, but this one is just nuts. > > > 2. The fs_objects. This one looks weird. It shares the same shrinker > > with inode/dentry. The only user is XFS currently. But it looks it is > > not really memcg aware at all, right? > > It most definitely is. > > The VFS dentry and inode cache reclaim are memcg aware. The > fs_objects callout is for filesystem level object garbage collection > that can be done as a result of the dentry and inode caches being > reclaimed. > > i.e. once the VFS has reclaimed the inode attached to the memcg, it > is no longer attached and accounted to the memcg anymore. It is > owned by the filesystem at this point, and it is entirely up to the > filesytem to when it can then be freed. Most filesystems do it in > the inode cache reclaim via the ->destroy method. XFS, OTOH, tags > freeable inodes in it's internal radix trees rather than freeing > them immediately because it still may have to clean the inode before > it can be freed. Hence we defer freeing of inodes until the > ->fs_objects pass.... Aha, thanks for elaborating. Now I see what it is doing for... > > > They are managed by radix tree > > which is not per memcg by looking into xfs code, so the "per list_lru > > nr_deferred" can't work for it. I thought of a couple of ways to > > tackle it off the top of my head: > > A. Just ignore it. If the amount of fs_objects are negligible > > comparing to inode/dentry, then I think it can be just ignored and > > kept it as is. > > Ah, no, they are not negliable. Under memory pressure, the number of > objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs > objects to be reclaimed. The dentries and VFS inodes are owned by > VFS level caches and associated with memcgs, the fs_objects are only > visible to the filesystem. > > > B. Move it out of inode/dentry shrinker. Add a dedicated shrinker > > for it, for example, sb->s_fs_obj_shrink. > > No, they are there because the reclaim has to be kept in exact > proportion to the dentry and inode reclaim quantities. That's the > reason I put that code there in the first place: a separate inode > filesystem cache shrinker just didn't work well at all. > > > C. Make it really memcg aware and use list_lru. > > Two things. Firstly, objects are owned by the filesystem at this > point, not memcgs. Memcgs were detatched at the VFS inode reclaim > layer. > > Secondly, list-lru does not scale well enough for the use needed by > XFS. We use radix trees so we can do lockless batch lookups and > IO-efficient inode-order reclaim passes. We also have concurrent > reclaim capabilities because of the lockless tag lookup walks. > Using a list_lru for this substantially reduces reclaim performance > and greatly increases CPU usage of reclaim because of contention on > the internal list lru locks. Been there, measured that.... > > > I don't have any experience on XFS code, #C seems the most optimal, > > but should be the most time consuming, I'm not sure if it is worth it > > or not. So, #B sounds more preferred IMHO. > > I think you're going completely in the wrong direction. The problem > that needs solving is integrating shrinker scanning control state > with memcgs more tightly, not force every memcg aware shrinker to > use list_lru for their subsystem shrinker implementations.... Thanks a lot for all the elaboration and advice. Integrating shrinker scanning control state with memcgs more tightly makes sense to me. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx