On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70 [<ffffffff8112675b>] do_sys_poll+0x54b/0x680 [<ffffffff81126b4c>] sys_poll+0x7c/0xf0 [<ffffffff815b6866>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810eca1e>] do_wp_page+0x14e/0x800 [<ffffffff810edf04>] handle_pte_fault+0x264/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- >From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxx> Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html