Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high

Michal Hocko <mhocko@xxxxxxxxxx> · Wed, 19 Feb 2020 20:53:32 +0100

On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > We have received regression reports from users whose workloads moved
> > > into containers and subsequently encountered new latencies. For some
> > > users these were a nuisance, but for some it meant missing their SLA
> > > response times. We tracked those delays down to cgroup limits, which
> > > inject direct reclaim stalls into the workload where previously all
> > > reclaim was handled my kswapd.
> > 
> > I am curious why is this unexpected when the high limit is explicitly
> > documented as a throttling mechanism.
> 
> Memory.high is supposed to curb aggressive growth using throttling
> instead of OOM killing. However, if the workload has plenty of easily
> reclaimable memory and just needs to recycle a couple of cache pages
> to permit an allocation, there is no need to throttle the workload -
> just as there wouldn't be any need to trigger the OOM killer.
> 
> So it's not unexpected, but it's unnecessarily heavy-handed: since
> memory.high allows some flexibility around the target size, we can
> move the routine reclaim activity (cache recycling) out of the main
> execution stream of the workload, just like we do with kswapd. If that
> cannot keep up, we can throttle and do direct reclaim.
> 
> It doesn't change the memory.high semantics, but it allows exploiting
> the fact that we have SMP systems and can parallize the book keeping.

Thanks, this describes the problem much better and I believe this all
belongs to the changelog.

> > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > while keeping direct reclaim as a fallback. In our testing, this
> > > eliminated all direct reclaim from the affected workload.
> > 
> > Who is accounted for all the work? Unless I am missing something this
> > just gets hidden in the system activity and that might hurt the
> > isolation. I do see how moving the work to a different context is
> > desirable but this work has to be accounted properly when it is going to
> > become a normal mode of operation (rather than a rare exception like the
> > existing irq context handling).
> 
> Yes, the plan is to account it to the cgroup on whose behalf we're
> doing the work.

OK, great, because honestly I am not really sure we can merge this work
without that being handled, I am afraid. We've had similar attempts
- mostly to parallelize work on behalf of the process (e.g. address space
tear down) - and the proper accounting was always the main obstacle so we
really need to handle this problem for other reasons. This doesn't sound
very different. And your example of a workload not meeting SLAs just
shows that the amount of the work required for the high limit reclaim
can be non-trivial. Somebody has to do that work and we cannot simply
allow everybody else to pay for that.

> The problem is that we have a general lack of usable CPU control right
> now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> For workloads that are contended on CPU, we cannot enable the CPU
> controller because the scheduling latencies are too high. And for
> workloads that aren't CPU contended, well, it doesn't really matter
> where the reclaim cycles are accounted to.
> 
> Once we have the CPU controller up to speed, we can add annotations
> like these to account stretches of execution to specific
> cgroups. There just isn't much point to do it before we can actually
> enable CPU control on the real workloads where it would matter.
> 
> [ This is generally work in process: for example, if you isolate
>   workloads with memory.low, kswapd cpu time isn't accounted to the
>   cgroup that causes it. Swap IO issued by kswapd isn't accounted to
>   the group that is getting swapped.

Well, kswapd is a system activity and as such it is acceptable that it
is accounted to the system. But in this case we are talking about a
memcg configuration which influences all other workloads by stealing CPU
cycles from them without much throttling on the consumer side -
especially when the memory is reclaimable without a lot of sleeping or
contention on locks etc.

I am absolutely aware that we will never achieve a perfect isolation due
to all sorts of shared data structures, lock contention and what not but
this patch alone just allows spill over to unaccounted work way too
easily IMHO.

>   The memory consumed by struct
>   cgroup itself, the percpu allocations for the vmstat arrays etc.,
>   which is sizable, are not accounted to the cgroup that creates
>   subgroups, and so forth. ]

-- 
Michal Hocko
SUSE Labs