On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html