On 9/21/21 5:59 AM, Michal Hocko wrote:
On Mon 20-09-21 23:38:40, Vishnu Rangayyan wrote:
Processes inside a memcg that get core dumped when there is less memory
available in the memcg can have the core dumping interrupted by the
oom-killer.
We saw this with qemu processes inside a memcg, as in this trace below.
The memcg was not out of memory when the core dump was triggered.
Why is it important to mention that the the memcg was not oom when the
dump was triggered?
[201169.028782] qemu-kata-syste invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=-100
[...]
[201169.028863] memory: usage 12218368kB, limit 12218368kB, failcnt 1728013
it obviously is for the particular allocation from the core dumping
code.
[201169.028864] memory+swap: usage 12218368kB, limit 9007199254740988kB, failcnt 0
[201169.028864] kmem: usage 154424kB, limit 9007199254740988kB, failcnt 0
[201169.028880] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=podacfa3d53-2068-4b61-a754-fa21968b4201,mems_allowed=0-1,oom_memcg=/kubepods/burstable/podacfa3d53-2068-4b61-a754-fa21968b4201,task_memcg=/kubepods/burstable/podacfa3d53-2068-4b61-a754-fa21968b4201,task=qemu-kata-syste,pid=1887079,uid=0
[201169.028888] Memory cgroup out of memory: Killed process 1887079
(qemu-kata-syste) total-vm:13598556kB, anon-rss:39836kB, file-rss:8712kB, shmem-rss:12017992kB, UID:0 pgtables:24204kB oom_score_adj:-100
[201169.045201] oom_reaper: reaped process 1887079 (qemu-kata-syste), now anon-rss:0kB, file-rss:28kB, shmem-rss:12018016kB
This change adds an fsync only for regular file core dumps based on a
configurable limit core_sync_bytes placed alongside other core dump params
and defaults the limit to (an arbitrary value) of 128KB.
Setting core_sync_bytes to zero disables the sync.
This doesn't really explain neither the problem nor the solution.
My apologies for not explaining better.
Why
is fsync helping at all? Why do we need a new sysctl to address the
problem and how does it help to prevent the memcg OOM. Also why is this
a problem in the first place.
The simple intent is to allow the core dumping to succeed in low memory
situations where the dump_emit doesn't tip over the thing and trigger
the oom-killer. This change avoids only that particular issue.
Agree, its not the actual problem at all. If the core dumping fails,
that sometimes prevents or delays looking into the actual issue.
The sysctl was to allow disabling this behavior or to fine tune for
special cases such as limited memory environments.
Have a look at the oom report. It says that only 8MB of the 11GB limit
is consumed by the file backed memory. The absolute majority (98%) is
sitting in the shmem and fsync will not help a wee bit there.
Agree.