On Fri, Mar 23, 2018 at 03:39:53PM +0300, Kirill Tkhai wrote: > On 23.03.2018 02:46, Dave Chinner wrote: > > On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote: > >> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365. > > > > Oh, finally you tell me what the problem is that you're trying to > > solve. I *asked this several times* and got no response. Thank you > > for wasting so much of my time. > > > >> Current shrinker is not scalable. Then there are many memcg and mounts, > >> the time of iteration shrink_slab() in case of global reclaim can > >> take much time. There is times of shrink_slab() by the link. A node > >> with 200 containers may waste 4 seconds on global reclaim just to > >> iterate over all shrinkers for all cgroups, call shrinker::count_objects() > >> and receive 0 zero objects. > > > > So, your problem is the way the memcgs were tacked onto the side > > of the list_lru infrastructure and are iterated, which has basically > > nothing to do with the way the low level XFS inode shrinker behaves. > > > > /me looks at the patches > > > > /me shudders at the thought of external "cache has freeable items" > > control for the shrinking of vfs caches. > > > > Biggest problem I see with this is the scope for coherency bugs ini > > the "memcg shrinker has freeable items" tracking. If that happens, > > there's no way of getting that memcg to run it's shrinker ever > > again. That seems very, very fragile and likely to be an endless > > source of OOM bugs. The whole point of the shrinker polling > > infrastructure is that it is not susceptible to this sort of bug. > > > > Really, the problem is that there's no separate list of memcg aware > > shrinkers, so every registered shrinker has to be iterated just to > > find the one shrinker that is memcg aware. > > I don't think the logic is difficult. There are generic rules, > and the only task is to teach them memcg-aware shrinkers. Currently, > they are only super block and workingsets shrinkers, and both of > them are based on generic list_lru infrastructure. Shrinker-related > bit is also cleared in generic code (shrink_slab()) only, and > the algorhythm doesn't allow to clear it without double check. > The only principle modification I'm thinking about is we should > clear the bit only when the shrinker is called with maximum > parameters: priority and GFP. Lots of "simple logic" combined together makes for a complex mass of difficult to understand and debug code. And, really, you're not suffering from a memcg problem - you're suffering from a "there are thousands of shrinkers" scalability issue because superblocks have per-superblock shrinker contexts and you have thousands of mounted filesystems. > There are a lot of performance improving synchronizations in kernel, > and they had been refused, the kernel would have remained in the age > of big kernel lock. That's false equivalence and hyperbole. The shrinkers are not limiting what every Linux user can do with their hardware. It's not a fundamental architectural limitation. These sorts of arguments are not convincing - this is the second time I've told you this, so please stick to technical arguments and drop the dramatic "anti-progress" conspiracy theory bullshit. > > So why not just do the simple thing which is create a separate > > "memcg-aware" shrinker list (i.e. create shrinker_list_memcg > > alongside shrinker_list) and only iterate the shrinker_list_memcg > > when a memcg is passed to shrink_slab()? > > > > That means we'll only run 2 shrinkers per memcg at most (sueprblock > > and working set) per memcg reclaim call. That's a simple 10-20 line > > change, not a whole mess of new code and issues... > > It was the first optimization, which came to my head, by there is no > almost a performance profit, since memcg-aware shrinkers still be called > per every memcg, and they are the biggest part of shrinkers in the system. Sure, but a polling algorithm is not a fundamental performance limitation. The problem here is that the memcg infrastructure has caused an exponential explosion in shrinker scanning. > >> Can't we call shrink of shared objects only for top memcg? Something like > >> this: > > > > What's a "shared object", and how is it any different to a normal > > slab cache object? > > Sorry, it's erratum. I'm speaking about cached objects. I mean something like > the below. The patch makes cached objects be cleared outside the memcg iteration > cycle (it has not sense to call them for every memcg since cached object logic > just ignores memcg). The cached flag seems like a hack to me. It does nothing to address the number of shrinker callout calls (it actually increases them!), just tries to hack around something you want a specific shrinker to avoid doing. I've asked *repeatedly* for a description of the actual workload problems the XFS shrinker behaviour is causing you. In the absence of any workload description, I'm simply going to NACK any changes that try to work around this behaviour. The rest of this reply is about shrinker infrastructure and (the lack of) memcg integration. > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -482,6 +482,33 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > return freed; > } > > +static void shrink_cached_slab(gfp_t gfp_mask, int nid, int priority) > +{ > + struct shrinker *shrinker; > + > + if (!down_read_trylock(&shrinker_rwsem)) > + return; > + > + list_for_each_entry(shrinker, &shrinker_list, list) { > + struct shrink_control sc = { > + .gfp_mask = gfp_mask, > + .nid = nid, > + .cached = 1, > + }; > + > + if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) > + continue; > + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) > + sc.nid = 0; > + if (rwsem_is_contended(&shrinker_rwsem)) > + break; > + > + do_shrink_slab(&sc, shrinker, priority); > + } > + > + up_read(&shrinker_rwsem); Ok, so this shrinks /only/ non-memcg slabs. It's identical to calling shrink_slab() with a null memcg. We already do this in all the relevant memory reclaim paths when necessary.... > +} > + > void drop_slab_node(int nid) > { > unsigned long freed; > @@ -493,6 +520,8 @@ void drop_slab_node(int nid) > do { > freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); > + > + shrink_cached_slab(GFP_KERNEL, nid, 0); > } while (freed > 10); ... like here, where the inner loop always calls shrink_slab() with a NULL memcg as it's first iteration. So your addition here is basically a no-op - we've already run non-memcg cache reclaim by the time your addition to run non-memcg cache reclaim is called. And here: > @@ -2573,6 +2602,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, > sc->priority); > > + shrink_cached_slab(sc->gfp_mask, pgdat->node_id, sc->priority); > + You're adding unconditional global slab cache reclaim to a path that already does this but only in the calling contexts where it global slab reclaim is actually necessary. Doing unconditional global slab reclaim here will cause regressions. So, let's address the number of shrinker callouts from drop_caches(). You're original example from here: https://lkml.org/lkml/2018/3/21/365 | In case of global reclaim, a task has to iterate all over the memcgs | and to call all the memcg-aware shrinkers for all of them. This means, | the task has to visit 200 * 10 = 2000 shrinkers for every memcg, | and since there are 2000 memcgs, the total calls of do_shrink_slab() | are 2000 * 2000 = 4000000. The loop is: drop_slab() for_each_online_node(nid) drop_slab_node(nid); for each memcg including NULL shrink_slab(nid, memcg) With 200 containers w/ 10 mounts X 10 memcgs each. That means we have 2000 mounts and hence 2000 memcg aware superblock shrinkers. we have 2000 memcgs, so each superblock has 2000 memcgs attached to it. And then we have N nodes in the machine. So drop_slab() will call drop_slab_node() N times, and drop_slab node will iterate all the memcgs until there's no more to free. Each memcg with then iterate all shrinkers to find the memcg aware shrinkers, then do work. IOWs, we actually have a minimum scan of N * 2000 * 2000 calls to shrink_slab(), not just 2000 * 2000. The thing is, memcg shrinker contexts are not NUMA aware. They use the list array that was implemented for memory nodes and instead use it for a memcg list array, indexed by memcg ID rather than node ID. IOWs, memcgs are not numa aware, so they do not need to be called for every single node in the system. Only shrinkers in a global context (i.e. no memcg) that are NUMA aware need to be called for every node. The patchset you posted implemented that mapping by adding a bitmap of memcg-internal shrinker dirty state and a global array that points directly to the struct shrinker that the bitmap index is associated with. This externalises the dirty tracking into the memcg code, which as I've already said I don't like. The real problem here is that the shrinker itself is supposed to keep context specific call-to-call information. In the case of NUMA-aware shrinkers, that's kept in the shrinker->nr_deferred[nid] array. That's dynamically allocated for numa aware shrinkers, but all memcg's that are called for the given shrinker all smash their state into shrinker->nr_deferred[nid] - they don't have their own nr_deferred state. This actually screws up the nr_deferred state for global reclaim, and causes unexpected unfairness to memcg reclaim. i.e. Global, non-memcg reclaim can defer pressure to all memcgs that shrinker is called on for that node because the nr_deferred count is shared based on nid. That means work we defer from one call to a memcg on nid=0 will be applied to the next memcg recalim call on nid=0, regardless of what it is. And if that first shrinker is then called next on nid=1, it won't get it's nr_deferred from it's last call on nid=0, it will get whatever is currently stored on nid=1. IOWs, the memcg shrinker infrastructure itself violates the principles of memcg isolation and fairness by completely screwing up the call-to-call deferred work counts for individual memcgs and global shrinker reclaim. If you're interested in fairness and memcg isolation in reclaim, then this is the first problem you need to solve. IMO, What we should be doing here is something like this: drop_slab() { do { freed = shrink_slab_memcg() freed += shrink_slab_numa() freed += shrink_slab_simple() } while (freed < 10) } i.e. we set up a distinct separation of shrinker roles according to the contexts in which they are called. e.g. shrink_slab_memcg() { for_each_memcg(memcg) for_each_shrinker(&memcg_shrinker_list) { nr_deferred = shrinker_get_memcg_deferred(shrinker, memcg) do_shrink_slab(shrinker, 0, memcg, &nr_deferred)) shrinker_store_memcg_deferred(shrinker, memcg, nr_deferred) } } shrink_slab_numa() { for_each_online_node(nid) for_each_shrinker(&numa_shrinker_list) { nr_deferred = shrinker_get_node_deferred(shrinker, nid) do_shrink_slab(shrinker, nid, NULL, &nr_deferred) shrinker_store_node_deferred(shrinker, nid, nr_deferred) } } shrink_slab_simple() { for_each_shrinker(&simple_shrinker_list) { nr_deferred = shrinker_get_node_deferred(shrinker, 0) do_shrink_slab(shrinker, 0, NULL, &nr_deferred) shrinker_store_node_deferred(shrinker, 0, nr_deferred) } } [ Note: This implies a shrinker can be on multiple lists. That's fine, we can already do concurrent calls into a shrinker from different reclaim contexts. But in doing this, the "iterate shrinker list to find memcg aware shrinkers" goes aware. Similarly with the basic shrinkers - we stop hitting them for every node we scan in global reclaim and hence putting lots more pressure on them because they are being treated as "one cache per node" instead of a single cache. ] The simple and numa cases can share nr_deferred[0] as they currently do because a shrinker will only ever be one or the other, but we still need shrink_slab_memcg() to have per-memcg nr_deferred values because simple shrinkers can still be MEMCG_AWARE. This makes it simpler to see the different calling contexts of shrinkers, and easier for the calling code to call the correct shrinker context with the correct state. It also makes it easier to pull memcg-awareness into the struct shrinker. Further, it allows us to solve the dirty/clean shrinker skip optimisations in a generic manner, because that can now be stored on a per-node or per-memcg basis inside the struct shrinker. It's available to all shrinkers, rather than being just a memcg-specific optimisation, and it's easy to implement as part of a shrinker list walk (i.e. for_each_dirty_shrinker()). We can also add simple wrappers like super_mark_shrinker_dirty(sb) when we add an item to the LRU in the appropriate places to do the work of activating the shrinker. This seems much more appropriate to me, because it fixes the problem of scanning/polling too many shrinkers at the shrinker state/context level. That makes it generic and not memcg specific so helps non-memcg system/workloads, too. It also fixes existing call-to-call shrinker state problems that memcg reclaim currently causes and so there's multiple levels of win here. Once this is all sorted out, then we can deal with the subsystem shrinker behaviour knowing it doesn't have to work around imbalance and fairness problems in the infrastructure.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html