Re: MM global locks as core counts quadruple

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 24 Jun 2024 14:44:53 -0700 (PDT)

On Sun, 23 Jun 2024, Tejun Heo wrote:

> Hello,
> 
> On Fri, Jun 21, 2024 at 02:37:43PM -0700, Namhyung Kim wrote:
> > > >  - cgroup_threadgroup_rwsem
> > >
> > > This one shouldn't matter at all in setups where new cgroups are populated
> > > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in
> > > such usage pattern, which should be the vast majority already, I think. Are
> > > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP?
> > 
> > I'm afraid there are still some use cases in Google that migrate processes
> > and/or threads between cgroups. :(
> 
> I see. I wonder whether we can turn this into a cgroup lock. It's not
> straightforward tho. It's protecting migration against forking and exiting
> and the only way to turn it into per-cgroup lock would be tying it to the
> source cgroup as that's the only thing identifiable from the fork and exit
> paths. The problem is that a single atomic migration operation can pull
> tasks from multiple cgroups into one destination cgroup, even on cgroup2 due
> to the threaded cgroups. This would be pretty rare on cgroup2 but still need
> to be handled which means grabbing multiple locks from the migration path.
> Not the end of the world but a bit nasty.
> 
> But, as long as it's well encapsulated and out of line, I don't see problems
> with such approach.
> 
> As for cgroup_mutex, it's more complicated as the usage is more spread, but
> yeah, the only solution there too would be going for finer grained locking
> whether that's hierarchical or hashed.
> 

Thanks all for the great discussion in the thread so far!

Beyond the discussion of cgroup mutexes above, we also discussed 
increasing the number of zones within a NUMA node.  I'm thinking that this 
would actually be an implementation detail, i.e. we wouldn't need to 
change any user visible interfaces like /proc/zoneinfo.  IOW, we could 
have 64 16GB ZONE_NORMALs spanning 1TB of memory, and we could sum up the 
memory resident across all of those when describing the memory to 
userspace.

Anybody else working on any of the following or have thoughts/ideas for 
how they could be improved as core counts increase?

 - list_lrus_mutex
 - pcpu_drain_mutex
 - shrinker_mutex (formerly shrinker_rwsem)
 - vmap_purge_lock

Also, any favorite benchmarks that people use with high core counts to 
measure the improvement when generic MM locks become more sharded?  I can 
imagine running will-it-scale on platforms with >= 256 cores per socket 
but if there are specific stress tests that can help quantify the impact, 
that would be great to know about.