On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > You can either try to use cgroup v2 which has much better memcg aware dirty > > throttling implementation so such a large amount of dirty pages doesn't > > accumulate in the first place > > I'd love to use cgroup v2, however this is docker + kubernetes so that > would require a lot of changes on our end to make happen, given how > recently container runtimes gained cgroup v2 support. > > > I pressume you are using defaults for > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > available memory. I would recommend using their resp. *_bytes > > alternatives and use something like 500M for background and 800M for > > dirty_bytes. > > We're using the defaults right now, however, given that this is a > containerized environment, it's problematic to set these values too > low system-wide since the containers all have dedicated volumes with > varying performance (from as low as 100MB/sec to gigabyes). Looking > around, I see that there were patches in the past to set per-cgroup > vm.dirty settings, however it doesn't look like those ever made it > into the kernel unless I'm missing something. I am not aware of that work for memcg v1. > In practice, maybe 500M > and 800M wouldn't be so bad though and may improve latency in other > ways. The other problem is that this also sets an upper bound on the > minimum container size for anything that does do IO. Well this would be a conservative approach but most allocations will simply be throttled during reclaim. It is the restricted memory reclaim context that is the bummer here. I have already brought up why this is the case in the generic write(2) system call path [1]. Maybe we can reduce the amount of NOFS requests. > That said, I'll > still I'll tune these settings in our infrastructure and see how > things go, but it sounds like something should be done inside the > kernel to help this situation, since it's so easy to trigger, but > looking at the threads that led to the commits you referenced, I can > see that this is complicated. Yeah, there are certainly things that we should be doing and reducing the NOFS allocations is the first step. From my past experience non trivial usage has turned out to be used incorrectly. I am not sure how much we can do for cgroup v1 though. If tuning for global dirty thresholds doesn't lead to a better behavior we can think of a band aid of some form. Something like this (only compile tested) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05b4ec2c6499..4e1e8d121785 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) goto retry; + /* + * Legacy memcg relies on dirty data throttling during the reclaim + * but this cannot be done for GFP_NOFS requests so we might trigger + * the oom way too early. Throttle here if we have way too many + * dirty/writeback pages. + */ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), + writeback = memcg_page_state(memcg, NR_WRITEBACK); + + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) + schedule_timeout_interruptible(1); + } + if (nr_retries--) goto retry; [1] http://lkml.kernel.org/r/20200415070228.GW4629@xxxxxxxxxxxxxx -- Michal Hocko SUSE Labs