On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote: > On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > > You can either try to use cgroup v2 which has much better memcg aware dirty > > > throttling implementation so such a large amount of dirty pages doesn't > > > accumulate in the first place > > > > I'd love to use cgroup v2, however this is docker + kubernetes so that > > would require a lot of changes on our end to make happen, given how > > recently container runtimes gained cgroup v2 support. > > > > > I pressume you are using defaults for > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > > available memory. I would recommend using their resp. *_bytes > > > alternatives and use something like 500M for background and 800M for > > > dirty_bytes. > > > > We're using the defaults right now, however, given that this is a > > containerized environment, it's problematic to set these values too > > low system-wide since the containers all have dedicated volumes with > > varying performance (from as low as 100MB/sec to gigabyes). Looking > > around, I see that there were patches in the past to set per-cgroup > > vm.dirty settings, however it doesn't look like those ever made it > > into the kernel unless I'm missing something. > > I am not aware of that work for memcg v1. > > > In practice, maybe 500M > > and 800M wouldn't be so bad though and may improve latency in other > > ways. The other problem is that this also sets an upper bound on the > > minimum container size for anything that does do IO. > > Well this would be a conservative approach but most allocations will > simply be throttled during reclaim. It is the restricted memory reclaim > context that is the bummer here. I have already brought up why this is > the case in the generic write(2) system call path [1]. Maybe we can > reduce the amount of NOFS requests. > > > That said, I'll > > still I'll tune these settings in our infrastructure and see how > > things go, but it sounds like something should be done inside the > > kernel to help this situation, since it's so easy to trigger, but > > looking at the threads that led to the commits you referenced, I can > > see that this is complicated. > > Yeah, there are certainly things that we should be doing and reducing > the NOFS allocations is the first step. From my past experience > non trivial usage has turned out to be used incorrectly. I am not sure > how much we can do for cgroup v1 though. If tuning for global dirty > thresholds doesn't lead to a better behavior we can think of a band aid > of some form. Something like this (only compile tested) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 05b4ec2c6499..4e1e8d121785 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > ie (mem_cgroup_wait_acct_move(mem_over_limit)) > goto retry; > > + /* > + * Legacy memcg relies on dirty data throttling during the reclaim > + * but this cannot be done for GFP_NOFS requests so we might trigger > + * the oom way too early. Throttle here if we have way too many > + * dirty/writeback pages. > + */ > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), > + writeback = memcg_page_state(memcg, NR_WRITEBACK); > + > + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) > + schedule_timeout_interruptible(1); > + } > + > if (nr_retries--) > goto retry; > > > [1] http://lkml.kernel.org/r/20200415070228.GW4629@xxxxxxxxxxxxxx > -- > Michal Hocko > SUSE Labs Hi Michal, Following up my conversation from bugzilla here: I am currently seeing the same issue when migrating a container from 4.14 to 5.4+ kernels. I tested this patch with a configuration where application reaches cgroups memory limit while doing IO. The issue is similar to described here https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in write syscall due to restricted memory reclamation. I tested your patch however, I have to increase the jiffies from 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. I also tried adjusting the dirty_bytes* and it worked after some tuning however, there's no one set of values suits all use cases kind of scenario. Hence it does not look like a viable option for me to change those defaults here and expect it work for all kind of workloads. I think working out a fix in kernel may be a better option since this issue will be seen ins o many use cases where applications are used to old kernel behavior and they suddenly start failing on newer ones. I see the same stack trace on 4.19 kernel too. Here is the stack trace: dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997 CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1 Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017 Call Trace: dump_stack+0x50/0x6b dump_header+0x4a/0x200 oom_kill_process+0xd7/0x110 out_of_memory+0x105/0x510 mem_cgroup_out_of_memory+0xb5/0xd0 try_charge+0x766/0x7c0 mem_cgroup_try_charge+0x70/0x190 __add_to_page_cache_locked+0x355/0x390 ? scan_shadow_nodes+0x30/0x30 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0xf5/0x210 grab_cache_page_write_begin+0x1f/0x40 iomap_write_begin.constprop.34+0x1ee/0x340 ? iomap_write_end+0x91/0x240 iomap_write_actor+0x92/0x170 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_apply+0xba/0x130 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_file_buffered_write+0x62/0x90 ? iomap_dirty_actor+0x1b0/0x1b0 xfs_file_buffered_aio_write+0xca/0x310 [xfs] new_sync_write+0x11b/0x1b0 vfs_write+0xad/0x1a0 ksys_write+0xa1/0xe0 do_syscall_64+0x48/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc956e853ad ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55 RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001 RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000 memory: usage 30720kB, limit 30720kB, failcnt 424 memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd: anon 1089536 file 27475968 kernel_stack 73728 slab 1941504 sock 0 shmem 0 file_mapped 0 file_dirty 0 file_writeback 0 anon_thp 0 inactive_anon 0 active_anon 1351680 inactive_file 27705344 active_file 40960 unevictable 0 slab_reclaimable 819200 slab_unreclaimable 1122304 pgfault 23397 pgmajfault 0 workingset_refault 33 workingset_activate 33 workingset_nodereclaim 0 pgrefill 119108 pgscan 124436 pgsteal 928 pgactivate 123222 pgdeactivate 119083 pglazyfree 99 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 Tasks state (memory values in pages): [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 28589] 0 28589 242 1 28672 0 -998 pause [ 28703] 0 28703 399 1 40960 0 997 sh [ 28766] 0 28766 821 341 45056 0 997 dd oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0 Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB, anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:997 oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Here is a snippet of the container spec: containers: - image: docker.io/library/alpine:latest name: dd command: - sh args: - -c - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300 resources: requests: memory: 30Mi cpu: 20m limits: memory: 30Mi Thanks, Anchal Agarwal