On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote: > On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote: > > On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > > > You can either try to use cgroup v2 which has much better memcg aware dirty > > > > throttling implementation so such a large amount of dirty pages doesn't > > > > accumulate in the first place > > > > > > I'd love to use cgroup v2, however this is docker + kubernetes so that > > > would require a lot of changes on our end to make happen, given how > > > recently container runtimes gained cgroup v2 support. > > > > > > > I pressume you are using defaults for > > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > > > available memory. I would recommend using their resp. *_bytes > > > > alternatives and use something like 500M for background and 800M for > > > > dirty_bytes. > > > > > > We're using the defaults right now, however, given that this is a > > > containerized environment, it's problematic to set these values too > > > low system-wide since the containers all have dedicated volumes with > > > varying performance (from as low as 100MB/sec to gigabyes). Looking > > > around, I see that there were patches in the past to set per-cgroup > > > vm.dirty settings, however it doesn't look like those ever made it > > > into the kernel unless I'm missing something. > > > > I am not aware of that work for memcg v1. > > > > > In practice, maybe 500M > > > and 800M wouldn't be so bad though and may improve latency in other > > > ways. The other problem is that this also sets an upper bound on the > > > minimum container size for anything that does do IO. > > > > Well this would be a conservative approach but most allocations will > > simply be throttled during reclaim. It is the restricted memory reclaim > > context that is the bummer here. I have already brought up why this is > > the case in the generic write(2) system call path [1]. Maybe we can > > reduce the amount of NOFS requests. > > > > > That said, I'll > > > still I'll tune these settings in our infrastructure and see how > > > things go, but it sounds like something should be done inside the > > > kernel to help this situation, since it's so easy to trigger, but > > > looking at the threads that led to the commits you referenced, I can > > > see that this is complicated. > > > > Yeah, there are certainly things that we should be doing and reducing > > the NOFS allocations is the first step. From my past experience > > non trivial usage has turned out to be used incorrectly. I am not sure > > how much we can do for cgroup v1 though. If tuning for global dirty > > thresholds doesn't lead to a better behavior we can think of a band aid > > of some form. Something like this (only compile tested) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 05b4ec2c6499..4e1e8d121785 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > ie (mem_cgroup_wait_acct_move(mem_over_limit)) > > goto retry; > > > > + /* > > + * Legacy memcg relies on dirty data throttling during the reclaim > > + * but this cannot be done for GFP_NOFS requests so we might trigger > > + * the oom way too early. Throttle here if we have way too many > > + * dirty/writeback pages. > > + */ > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { > > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), > > + writeback = memcg_page_state(memcg, NR_WRITEBACK); > > + > > + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) > > + schedule_timeout_interruptible(1); > > + } > > + > > if (nr_retries--) > > goto retry; > > > > > > [1] http://lkml.kernel.org/r/20200415070228.GW4629@xxxxxxxxxxxxxx > > -- > > Michal Hocko > > SUSE Labs > Hi Michal, > Following up my conversation from bugzilla here: > I am currently seeing the same issue when migrating a container from 4.14 to > 5.4+ kernels. I tested this patch with a configuration where application reaches > cgroups memory limit while doing IO. The issue is similar to described here > https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in > write syscall due to restricted memory reclamation. > I tested your patch however, I have to increase the jiffies from > 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. > I also tried adjusting the dirty_bytes* and it worked after some tuning however, > there's no one set of values suits all use cases kind of scenario. > Hence it does not look like a viable option for me to change those defaults here and > expect it work for all kind of workloads. I think working out a fix in kernel may be a > better option since this issue will be seen ins o many use cases where > applications are used to old kernel behavior and they suddenly start failing on > newer ones. > I see the same stack trace on 4.19 kernel too. > > Here is the stack trace: > > dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997 > CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1 > Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x50/0x6b > dump_header+0x4a/0x200 > oom_kill_process+0xd7/0x110 > out_of_memory+0x105/0x510 > mem_cgroup_out_of_memory+0xb5/0xd0 > try_charge+0x766/0x7c0 > mem_cgroup_try_charge+0x70/0x190 > __add_to_page_cache_locked+0x355/0x390 > ? scan_shadow_nodes+0x30/0x30 > add_to_page_cache_lru+0x4a/0xc0 > pagecache_get_page+0xf5/0x210 > grab_cache_page_write_begin+0x1f/0x40 > iomap_write_begin.constprop.34+0x1ee/0x340 > ? iomap_write_end+0x91/0x240 > iomap_write_actor+0x92/0x170 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_apply+0xba/0x130 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_file_buffered_write+0x62/0x90 > ? iomap_dirty_actor+0x1b0/0x1b0 > xfs_file_buffered_aio_write+0xca/0x310 [xfs] > new_sync_write+0x11b/0x1b0 > vfs_write+0xad/0x1a0 > ksys_write+0xa1/0xe0 > do_syscall_64+0x48/0xf0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7fc956e853ad > ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca > 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 > 02 00 00 00 49 89 f4 be 00 88 08 00 55 > RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad > RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001 > RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000 > memory: usage 30720kB, limit 30720kB, failcnt 424 > memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0 > kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0 > Memory cgroup stats for > /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd: > anon 1089536 > file 27475968 > kernel_stack 73728 > slab 1941504 > sock 0 > shmem 0 > file_mapped 0 > file_dirty 0 > file_writeback 0 > anon_thp 0 > inactive_anon 0 > active_anon 1351680 > inactive_file 27705344 > active_file 40960 > unevictable 0 > slab_reclaimable 819200 > slab_unreclaimable 1122304 > pgfault 23397 > pgmajfault 0 > workingset_refault 33 > workingset_activate 33 > workingset_nodereclaim 0 > pgrefill 119108 > pgscan 124436 > pgsteal 928 > pgactivate 123222 > pgdeactivate 119083 > pglazyfree 99 > pglazyfreed 0 > thp_fault_alloc 0 > thp_collapse_alloc 0 > Tasks state (memory values in pages): > [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj > name > [ 28589] 0 28589 242 1 28672 0 -998 pause > [ 28703] 0 28703 399 1 40960 0 997 sh > [ 28766] 0 28766 821 341 45056 0 997 dd > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0 > Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB, > anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB > oom_score_adj:997 > oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB, > shmem-rss:0kB > > > Here is a snippet of the container spec: > > containers: > - image: docker.io/library/alpine:latest > name: dd > command: > - sh > args: > - -c > - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300 > resources: > requests: > memory: 30Mi > cpu: 20m > limits: > memory: 30Mi > > Thanks, > Anchal Agarwal A gentle ping on this issue! Thanks, Anchal Agarwal