Re: [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 21 May 2020 12:38:33 -0400

On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 09:51:52, Johannes Weiner wrote:
> > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote:
> [...]
> > > I am not saying the looping over try_to_free_pages is wrong. I do care
> > > about the final reclaim target. That shouldn't be arbitrary. We have
> > > established a target which is proportional to the requested amount of
> > > memory. And there is a good reason for that. If any task tries to
> > > reclaim down to the high limit then this might lead to a large
> > > unfairness when heavy producers piggy back on the active reclaimer(s).
> > 
> > Why is that different than any other form of reclaim?
> 
> Because the high limit reclaim is a best effort rather than must to
> either get over reclaim watermarks and continue allocation or meet the
> hard limit requirement to continue.

It's not best effort. It's a must-meet or get put to sleep. You are
mistaken about what memory.high is.

> In an ideal world even the global resp. hard limit reclaim should
> consider fairness. They don't because that is easier but that sucks. I
> have been involved in debugging countless of issues where direct reclaim
> was taking too long because of the unfairness. Users simply see that as
> bug and I am not surprised.

Then there should be a generic fix to this problem (like the page
capturing during reclaim).

You're bringing anecdotal evidence that reclaim has a generic problem,
which nobody has seriously tried to fix in recent times, and then ask
people to hack around it in a patch that only brings the behavior for
this specific instance in line with everybody else.

I'm sorry, but this IS a black and white issue, and I think you're out
of line here. If you think reclaim fairness is a problem, it should be
on you to provide concrete data for that and propose changes on how we
do reclaim, instead of asking to hack around it in one callsite -
thereby introducing inconsistencies to userspace between different
limits, as well as inconsistencies and complications for the kernel
developers that actually work on this code (take a look at git blame).

> > > I wouldn't mind to loop over try_to_free_pages to meet the requested
> > > memcg_nr_pages_over_high target.
> > 
> > Should we do the same for global reclaim? Move reclaim to userspace
> > resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and
> > then have everybody just reclaim exactly what they asked for, and punt
> > interrupts / kthread allocations to a worker/kswapd?
> 
> This would be quite challenging considering the page allocator wouldn't
> be able to make a forward progress without doing any reclaim. But maybe
> you can be creative with watermarks.

I clarified in the follow-up email that I meant limit reclaim.

> > > > > > > Also if the current high reclaim scaling is insufficient then we should
> > > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly
> > > > > > > unbound number of reclaim retries.
> > > > > > 
> > > > > > ???
> > > > > 
> > > > > I am not sure what you are asking here.
> > > > 
> > > > You expressed that some alternate solution B would be preferable,
> > > > without any detail on why you think that is the case.
> > > > 
> > > > And it's certainly not obvious or self-explanatory - in particular
> > > > because Chris's proposal *is* obvious and self-explanatory, given how
> > > > everybody else is already doing loops around page reclaim.
> > > 
> > > Sorry, I could have been less cryptic. I hope the above and my response
> > > to Chris goes into more details why I do not like this proposal and what
> > > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high
> > > target. If the current calculation of the target is unsufficient - e.g.
> > > in situations where the high limit excess is very large then this should
> > > be reflected in memcg_nr_pages_over_high.
> > > 
> > > Is it more clear?
> > 
> > Well you haven't made a good argument why memory.high is actually
> > different than any other form of reclaim, and why it should be the
> > only implementation of page reclaim that has special-cased handling
> > for the inherent "unfairness" or rather raciness of that operation.
> > 
> > You cut these lines from the quote:
> > 
> >   Under pressure, page reclaim can struggle to satisfy the reclaim
> >   goal and may return with less pages reclaimed than asked to.
> > 
> >   Under concurrency, a parallel allocation can invalidate the reclaim
> >   progress made by a thread.
> > 
> > Even if we *could* invest more into trying to avoid any unfairness,
> > you haven't made a point why we actually should do that here
> > specifically, yet not everywhere else.
> 
> I have tried to explain my thinking elsewhere in the thread. The bottom
> line is that high limit is a way of throttling rather than meeting a
> specific target.

That's an incorrect assumption. Of course it should meet the specific
target that the user specified.

> > (And people have tried to do it for global reclaim[1], but clearly
> > this isn't a meaningful problem in practice.)
> > 
> > I have a good reason why we shouldn't: because it's special casing
> > memory.high from other forms of reclaim, and that is a maintainability
> > problem. We've recently been discussing ways to make the memory.high
> > implementation stand out less, not make it stand out even more. There
> > is no solid reason it should be different from memory.max reclaim,
> > except that it should sleep instead of invoke OOM at the end. It's
> > already a mess we're trying to get on top of and straighten out, and
> > you're proposing to add more kinks that will make this work harder.
> 
> I do see your point of course. But I do not give the code consistency
> a higher priority than the potential unfairness aspect of the user
> visible behavior for something that can do better.

Michal, you have almost no authorship stake in this code base. Would
it be possible to defer judgement on maintainability to people who do?