Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4

Anchal Agarwal <anchalag@xxxxxxxxxx> · Wed, 11 Aug 2021 19:41:23 +0000



On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote:
> On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> > On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > > You can either try to use cgroup v2 which has much better memcg aware dirty
> > > > throttling implementation so such a large amount of dirty pages doesn't
> > > > accumulate in the first place
> > > 
> > > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > > would require a lot of changes on our end to make happen, given how
> > > recently container runtimes gained cgroup v2 support.
> > > 
> > > > I pressume you are using defaults for
> > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > > available memory. I would recommend using their resp. *_bytes
> > > > alternatives and use something like 500M for background and 800M for
> > > > dirty_bytes.
> > > 
> > > We're using the defaults right now, however, given that this is a
> > > containerized environment, it's problematic to set these values too
> > > low system-wide since the containers all have dedicated volumes with
> > > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > > around, I see that there were patches in the past to set per-cgroup
> > > vm.dirty settings, however it doesn't look like those ever made it
> > > into the kernel unless I'm missing something.
> > 
> > I am not aware of that work for memcg v1.
> > 
> > > In practice, maybe 500M
> > > and 800M wouldn't be so bad though and may improve latency in other
> > > ways. The other problem is that this also sets an upper bound on the
> > > minimum container size for anything that does do IO.
> > 
> > Well this would be a conservative approach but most allocations will
> > simply be throttled during reclaim. It is the restricted memory reclaim
> > context that is the bummer here. I have already brought up why this is
> > the case in the generic write(2) system call path [1]. Maybe we can
> > reduce the amount of NOFS requests.
> > 
> > > That said, I'll
> > > still I'll tune these settings in our infrastructure and see how
> > > things go, but it sounds like something should be done inside the
> > > kernel to help this situation, since it's so easy to trigger, but
> > > looking at the threads that led to the commits you referenced, I can
> > > see that this is complicated.
> > 
> > Yeah, there are certainly things that we should be doing and reducing
> > the NOFS allocations is the first step. From my past experience
> > non trivial usage has turned out to be used incorrectly. I am not sure
> > how much we can do for cgroup v1 though. If tuning for global dirty
> > thresholds doesn't lead to a better behavior we can think of a band aid
> > of some form. Something like this (only compile tested)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 05b4ec2c6499..4e1e8d121785 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	ie (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		goto retry;
> >  
> > +	/*
> > +	 * Legacy memcg relies on dirty data throttling during the reclaim
> > +	 * but this cannot be done for GFP_NOFS requests so we might trigger
> > +	 * the oom way too early. Throttle here if we have way too many
> > +	 * dirty/writeback pages.
> > +	 */
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
> > +		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> > +			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
> > +
> > +		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
> > +			schedule_timeout_interruptible(1);
> > +	}
> > +
> >  	if (nr_retries--)
> >  		goto retry;
> >  
> > 
> > [1] http://lkml.kernel.org/r/20200415070228.GW4629@xxxxxxxxxxxxxx
> > -- 
> > Michal Hocko
> > SUSE Labs
> Hi Michal,
> Following up my conversation from bugzilla here:
> I am currently seeing the same issue when migrating a container from 4.14 to
> 5.4+ kernels. I tested this patch with a configuration where application reaches
> cgroups memory limit while doing IO. The issue is similar to described here
> https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
> write syscall due to restricted memory reclamation.
> I tested your patch however, I have to increase the jiffies from
> 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
> I also tried adjusting the dirty_bytes* and it worked after some tuning however,
> there's no one set of values suits all use cases kind of scenario.
> Hence it does not look like a viable option for me to change those defaults here and 
> expect it work for all kind of workloads. I think working out a fix in kernel may be a
> better option since this issue will be seen ins o many use cases where
> applications are used to old kernel behavior and they suddenly start failing on
> newer ones.
> I see the same stack trace on 4.19 kernel too.
> 
> Here is the stack trace:
> 
> dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997
> CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
> Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
> Call Trace:
> dump_stack+0x50/0x6b
> dump_header+0x4a/0x200
> oom_kill_process+0xd7/0x110
> out_of_memory+0x105/0x510
> mem_cgroup_out_of_memory+0xb5/0xd0
> try_charge+0x766/0x7c0
> mem_cgroup_try_charge+0x70/0x190
> __add_to_page_cache_locked+0x355/0x390
> ? scan_shadow_nodes+0x30/0x30
> add_to_page_cache_lru+0x4a/0xc0
> pagecache_get_page+0xf5/0x210
> grab_cache_page_write_begin+0x1f/0x40
> iomap_write_begin.constprop.34+0x1ee/0x340
> ? iomap_write_end+0x91/0x240
> iomap_write_actor+0x92/0x170
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_apply+0xba/0x130
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_file_buffered_write+0x62/0x90
> ? iomap_dirty_actor+0x1b0/0x1b0
> xfs_file_buffered_aio_write+0xca/0x310 [xfs]
> new_sync_write+0x11b/0x1b0
> vfs_write+0xad/0x1a0
> ksys_write+0xa1/0xe0
> do_syscall_64+0x48/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fc956e853ad
> ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca
> 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8
> 02 00 00 00 49 89 f4 be 00 88 08 00 55
> RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
> RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
> RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
> memory: usage 30720kB, limit 30720kB, failcnt 424
> memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for
> /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
> anon 1089536
> file 27475968
> kernel_stack 73728
> slab 1941504
> sock 0
> shmem 0
> file_mapped 0
> file_dirty 0
> file_writeback 0
> anon_thp 0
> inactive_anon 0
> active_anon 1351680
> inactive_file 27705344
> active_file 40960
> unevictable 0
> slab_reclaimable 819200
> slab_unreclaimable 1122304
> pgfault 23397
> pgmajfault 0
> workingset_refault 33
> workingset_activate 33
> workingset_nodereclaim 0
> pgrefill 119108
> pgscan 124436
> pgsteal 928
> pgactivate 123222
> pgdeactivate 119083
> pglazyfree 99
> pglazyfreed 0
> thp_fault_alloc 0
> thp_collapse_alloc 0
> Tasks state (memory values in pages):
> [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
> name
> [  28589]     0 28589      242        1    28672        0          -998 pause
> [  28703]     0 28703      399        1    40960        0           997 sh
> [  28766]     0 28766      821      341    45056        0           997 dd
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
> Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
> anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
> oom_score_adj:997
> oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
> shmem-rss:0kB
> 
> 
> Here is a snippet of the container spec:
> 
> containers:
> - image: docker.io/library/alpine:latest
> name: dd
> command:
> - sh
> args:
> - -c
> - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
> resources:
> requests:
> memory: 30Mi
> cpu: 20m
> limits:
> memory: 30Mi
> 
> Thanks,
> Anchal Agarwal
A gentle ping on this issue!

Thanks,
Anchal Agarwal