Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4

Anchal Agarwal <anchalag@xxxxxxxxxx> · Fri, 6 Aug 2021 20:42:46 +0000

On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > You can either try to use cgroup v2 which has much better memcg aware dirty
> > > throttling implementation so such a large amount of dirty pages doesn't
> > > accumulate in the first place
> > 
> > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > would require a lot of changes on our end to make happen, given how
> > recently container runtimes gained cgroup v2 support.
> > 
> > > I pressume you are using defaults for
> > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > available memory. I would recommend using their resp. *_bytes
> > > alternatives and use something like 500M for background and 800M for
> > > dirty_bytes.
> > 
> > We're using the defaults right now, however, given that this is a
> > containerized environment, it's problematic to set these values too
> > low system-wide since the containers all have dedicated volumes with
> > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > around, I see that there were patches in the past to set per-cgroup
> > vm.dirty settings, however it doesn't look like those ever made it
> > into the kernel unless I'm missing something.
> 
> I am not aware of that work for memcg v1.
> 
> > In practice, maybe 500M
> > and 800M wouldn't be so bad though and may improve latency in other
> > ways. The other problem is that this also sets an upper bound on the
> > minimum container size for anything that does do IO.
> 
> Well this would be a conservative approach but most allocations will
> simply be throttled during reclaim. It is the restricted memory reclaim
> context that is the bummer here. I have already brought up why this is
> the case in the generic write(2) system call path [1]. Maybe we can
> reduce the amount of NOFS requests.
> 
> > That said, I'll
> > still I'll tune these settings in our infrastructure and see how
> > things go, but it sounds like something should be done inside the
> > kernel to help this situation, since it's so easy to trigger, but
> > looking at the threads that led to the commits you referenced, I can
> > see that this is complicated.
> 
> Yeah, there are certainly things that we should be doing and reducing
> the NOFS allocations is the first step. From my past experience
> non trivial usage has turned out to be used incorrectly. I am not sure
> how much we can do for cgroup v1 though. If tuning for global dirty
> thresholds doesn't lead to a better behavior we can think of a band aid
> of some form. Something like this (only compile tested)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..4e1e8d121785 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	ie (mem_cgroup_wait_acct_move(mem_over_limit))
>  		goto retry;
>  
> +	/*
> +	 * Legacy memcg relies on dirty data throttling during the reclaim
> +	 * but this cannot be done for GFP_NOFS requests so we might trigger
> +	 * the oom way too early. Throttle here if we have way too many
> +	 * dirty/writeback pages.
> +	 */
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
> +		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> +			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
> +
> +		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
> +			schedule_timeout_interruptible(1);
> +	}
> +
>  	if (nr_retries--)
>  		goto retry;
>  
> 
> [1] http://lkml.kernel.org/r/20200415070228.GW4629@xxxxxxxxxxxxxx
> -- 
> Michal Hocko
> SUSE Labs
Hi Michal,
Following up my conversation from bugzilla here:
I am currently seeing the same issue when migrating a container from 4.14 to
5.4+ kernels. I tested this patch with a configuration where application reaches
cgroups memory limit while doing IO. The issue is similar to described here
https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
write syscall due to restricted memory reclamation.
I tested your patch however, I have to increase the jiffies from
1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
I also tried adjusting the dirty_bytes* and it worked after some tuning however,
there's no one set of values suits all use cases kind of scenario.
Hence it does not look like a viable option for me to change those defaults here and 
expect it work for all kind of workloads. I think working out a fix in kernel may be a
better option since this issue will be seen ins o many use cases where
applications are used to old kernel behavior and they suddenly start failing on
newer ones.
I see the same stack trace on 4.19 kernel too.

Here is the stack trace:

dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997
CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
Call Trace:
dump_stack+0x50/0x6b
dump_header+0x4a/0x200
oom_kill_process+0xd7/0x110
out_of_memory+0x105/0x510
mem_cgroup_out_of_memory+0xb5/0xd0
try_charge+0x766/0x7c0
mem_cgroup_try_charge+0x70/0x190
__add_to_page_cache_locked+0x355/0x390
? scan_shadow_nodes+0x30/0x30
add_to_page_cache_lru+0x4a/0xc0
pagecache_get_page+0xf5/0x210
grab_cache_page_write_begin+0x1f/0x40
iomap_write_begin.constprop.34+0x1ee/0x340
? iomap_write_end+0x91/0x240
iomap_write_actor+0x92/0x170
? iomap_dirty_actor+0x1b0/0x1b0
iomap_apply+0xba/0x130
? iomap_dirty_actor+0x1b0/0x1b0
iomap_file_buffered_write+0x62/0x90
? iomap_dirty_actor+0x1b0/0x1b0
xfs_file_buffered_aio_write+0xca/0x310 [xfs]
new_sync_write+0x11b/0x1b0
vfs_write+0xad/0x1a0
ksys_write+0xa1/0xe0
do_syscall_64+0x48/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fc956e853ad
ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca
4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8
02 00 00 00 49 89 f4 be 00 88 08 00 55
RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
memory: usage 30720kB, limit 30720kB, failcnt 424
memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for
/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
anon 1089536
file 27475968
kernel_stack 73728
slab 1941504
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 0
active_anon 1351680
inactive_file 27705344
active_file 40960
unevictable 0
slab_reclaimable 819200
slab_unreclaimable 1122304
pgfault 23397
pgmajfault 0
workingset_refault 33
workingset_activate 33
workingset_nodereclaim 0
pgrefill 119108
pgscan 124436
pgsteal 928
pgactivate 123222
pgdeactivate 119083
pglazyfree 99
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
name
[  28589]     0 28589      242        1    28672        0          -998 pause
[  28703]     0 28703      399        1    40960        0           997 sh
[  28766]     0 28766      821      341    45056        0           997 dd
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
oom_score_adj:997
oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
shmem-rss:0kB

Here is a snippet of the container spec:

containers:
- image: docker.io/library/alpine:latest
name: dd
command:
- sh
args:
- -c
- cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
resources:
requests:
memory: 30Mi
cpu: 20m
limits:
memory: 30Mi

Thanks,
Anchal Agarwal