On Fri, Jan 15, 2021 at 03:55:36PM -0500, Johannes Weiner wrote: > On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote: > > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote: > > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > > > When a value is written to a cgroup's memory.high control file, the > > > > > write() context first tries to reclaim the cgroup to size before > > > > > putting the limit in place for the workload. Concurrent charges from > > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > > for the workload, then do targeted reclaim until the new limit has > > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > > to the described starvation issue. However, this sequence could cause > > > > > excessive latencies in the workload, when allocating threads could be > > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > > by the write(), before that had a chance to work it off. > > > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > > limit, reflect that the cgroup may well fail to converge due to > > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > I can see that you have provided some more details in follow up replies > > > > but I do not see any explicit argument why an excessive time for writer > > > > is an actual problem. Could you be more specific? > > > > > > Our writer isn't necessarily time sensitive, but there is a difference > > > between a) the write taking a few seconds to reclaim down the > > > requested delta and b) the writer essentially turning into kswapd for > > > the workload and busy-spinning inside the kernel indefinitely. > > > > > > We've seen the writer stuck in this function for minutes, long after > > > the requested delta has been reclaimed, consuming alarming amounts of > > > CPU cycles - CPU time that should really be accounted to the workload, > > > not the system software performing the write. > > > > > > Obviously, we could work around it using timeouts and signals. In > > > fact, we may have to until the new kernel is deployed everywhere. But > > > this is the definition of an interface change breaking userspace, so > > > I'm a bit surprised by your laid-back response. > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > > Cc: <stable@xxxxxxxxxxxxxxx> # 5.8+ > > > > > > > > Why is this worth backporting to stable? The behavior is different but I > > > > do not think any of them is harmful. > > > > > > The referenced patch changed user-visible behavior in a way that is > > > causing real production problems for us. From stable-kernel-rules: > > > > > > - It must fix a real bug that bothers people (not a, "This could be a > > > problem..." type thing). > > > > > > > > Reported-by: Tejun Heo <tj@xxxxxxxxxx> > > > > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > > > > > > > I am not against the patch. The existing interface doesn't provide any > > > > meaningful feedback to the userspace anyway. User would have to re check > > > > to see the result of the operation. So how hard we try is really an > > > > implementation detail. > > > > > > Yeah, I wish it was a bit more consistent from an interface POV. > > > > > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > > > trying to reclaim went into the tree at the same exact time as Chris's > > > series "mm, memcg: reclaim harder before high throttling" (commit > > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > > > > > Chris's patch changes memory.high reclaim on the allocation side from > > > > > > reclaim once, sleep if there is still overage > > > > > > to > > > > > > reclaim the overage as long as you make forward progress; > > > sleep after 16 no-progress loops if there is still overage > > > > > > Roman's patch describes a problem where allocating threads go to sleep > > > when memory.high is lowered by a wider step. This is exceedingly > > > unlikely after Chris's change. > > > > > > Because after Chris's change, memory.high is reclaimed on the > > > allocation side as aggressively as memory.max. The only difference is > > > that upon failure, one sleeps and the other OOMs. > > > > > > If Roman's issue were present after Chris's change, then we'd also see > > > premature OOM kills when memory.max is lowered by a large step. And I > > > have never seen that happening. > > > > > > So I suggest instead of my fix here, we revert Roman's patch instead, > > > as it should no longer be needed. Thoughts? > > > > Chris's patch was merged way earlier than mine into the kernel tree which > > was used when I observed the problem in the production. So likely it was there. > > Chris's patch was in the tree earlier, but the first release > containing it was tagged a day before you put in your change, so I > doubt it was on the production system where you observed the issue. > > As per above, it'd be very surprising to see premature sleeps when > lowering memory.high, when allocation-side reclaim keeps going until > the cgroup meets the definition of OOM. > > > I think it makes sense to try to reclaim memory first before putting > > all processes in the cgroup into reclaim mode. Even without artificial delays > > it creates some latency and btw doesn't make the reclaim process more efficient. > > It's not obvious that this is a practical problem. It certainly isn't > for memory.max, Because memory.max is usually not adjusted dynamically? > and there should be a good reason why the two should > be different aside from the documented OOM vs sleep behavior. Maybe we have different examples in our heads, but mine is a cgroup with a significant amount of relatively cold pagecache and a multi-threaded workload. Now somebody wants to tighten memory.high. Why would we put all threads into a direct reclaim? I don't see a good reason. Memory.max is different because nobody (hopefully) is adjusting it dynamically to be just a little bit bigger than the workingset. Most likely it's set once that after the creation of a cgroup. So such a problem just doesn't exist.