On Mon, Jul 01, 2024 at 07:51:43PM -0700, Andrew Morton wrote: > On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > Currently, we're encountering latency spikes in our container environment > > when a specific container with multiple Python-based tasks exits. These > > tasks may hold the zone->lock for an extended period, significantly > > impacting latency for other containers attempting to allocate memory. > > Is this locking issue well understood? I cannot comment about others but I believe this problem to be well-understood. The zone->lock is an incredibly large lock at this point protecting an unbounded amount of data. As time goes by, it's just getting worse and it was terrible even a few years ago, let alone now. > Is anyone working on it? Not that I'm aware of but I've paid so little attention to linux-mm in the last few years, that's not saying much. The main problem is that it's hard to solve quickly as splitting that lock is possible, but not trivial. I am mildly concerned that more and more people are looking for ways of getting around zone->lock contention using the PCP allocator. I believe that to be a losing battle even though I added THP to the PCP caching myself. Now we have dynamic resizing which works ok but piling on top of it are file-backed THPs and THPs smaller than MAX_ORDER, folios in general etc. Dealing with that within PCP has limits and adding more sysctls to deal with corner cases is a band-aid that most users probably will miss. Working around all the zone->lock issues in PCP just delays the inevitable as PCP doesn't play well with overall availability (e.g. high order pages free but on a remote CPU), fragmentation control (frag fallback because desired page type are on a remote CPU) or scaling (because ultimately it can still contend on zone->lock). IIUC, pcp lists were originally about preserving cache hotness with zone->lock contention reduction as a bonus but now it's a band aid trying to deal with for zone->lock covering massive amounts of memory. Eventually the work will have to be put into splitting zone lock using something akin to memory arenas and moving away from zone_id to identify what range of free lists a particular page belongs to. -- Mel Gorman SUSE Labs