On Wed 19-02-20 14:16:18, Johannes Weiner wrote: > On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote: > > On Wed 19-02-20 13:12:19, Johannes Weiner wrote: > > > We have received regression reports from users whose workloads moved > > > into containers and subsequently encountered new latencies. For some > > > users these were a nuisance, but for some it meant missing their SLA > > > response times. We tracked those delays down to cgroup limits, which > > > inject direct reclaim stalls into the workload where previously all > > > reclaim was handled my kswapd. > > > > I am curious why is this unexpected when the high limit is explicitly > > documented as a throttling mechanism. > > Memory.high is supposed to curb aggressive growth using throttling > instead of OOM killing. However, if the workload has plenty of easily > reclaimable memory and just needs to recycle a couple of cache pages > to permit an allocation, there is no need to throttle the workload - > just as there wouldn't be any need to trigger the OOM killer. > > So it's not unexpected, but it's unnecessarily heavy-handed: since > memory.high allows some flexibility around the target size, we can > move the routine reclaim activity (cache recycling) out of the main > execution stream of the workload, just like we do with kswapd. If that > cannot keep up, we can throttle and do direct reclaim. > > It doesn't change the memory.high semantics, but it allows exploiting > the fact that we have SMP systems and can parallize the book keeping. Thanks, this describes the problem much better and I believe this all belongs to the changelog. > > > This patch adds asynchronous reclaim to the memory.high cgroup limit > > > while keeping direct reclaim as a fallback. In our testing, this > > > eliminated all direct reclaim from the affected workload. > > > > Who is accounted for all the work? Unless I am missing something this > > just gets hidden in the system activity and that might hurt the > > isolation. I do see how moving the work to a different context is > > desirable but this work has to be accounted properly when it is going to > > become a normal mode of operation (rather than a rare exception like the > > existing irq context handling). > > Yes, the plan is to account it to the cgroup on whose behalf we're > doing the work. OK, great, because honestly I am not really sure we can merge this work without that being handled, I am afraid. We've had similar attempts - mostly to parallelize work on behalf of the process (e.g. address space tear down) - and the proper accounting was always the main obstacle so we really need to handle this problem for other reasons. This doesn't sound very different. And your example of a workload not meeting SLAs just shows that the amount of the work required for the high limit reclaim can be non-trivial. Somebody has to do that work and we cannot simply allow everybody else to pay for that. > The problem is that we have a general lack of usable CPU control right > now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208. > For workloads that are contended on CPU, we cannot enable the CPU > controller because the scheduling latencies are too high. And for > workloads that aren't CPU contended, well, it doesn't really matter > where the reclaim cycles are accounted to. > > Once we have the CPU controller up to speed, we can add annotations > like these to account stretches of execution to specific > cgroups. There just isn't much point to do it before we can actually > enable CPU control on the real workloads where it would matter. > > [ This is generally work in process: for example, if you isolate > workloads with memory.low, kswapd cpu time isn't accounted to the > cgroup that causes it. Swap IO issued by kswapd isn't accounted to > the group that is getting swapped. Well, kswapd is a system activity and as such it is acceptable that it is accounted to the system. But in this case we are talking about a memcg configuration which influences all other workloads by stealing CPU cycles from them without much throttling on the consumer side - especially when the memory is reclaimable without a lot of sleeping or contention on locks etc. I am absolutely aware that we will never achieve a perfect isolation due to all sorts of shared data structures, lock contention and what not but this patch alone just allows spill over to unaccounted work way too easily IMHO. > The memory consumed by struct > cgroup itself, the percpu allocations for the vmstat arrays etc., > which is sizable, are not accounted to the cgroup that creates > subgroups, and so forth. ] -- Michal Hocko SUSE Labs