Hi azur, On Wed, Oct 09, 2013 at 08:44:50PM +0200, azurIt wrote: > Joahnnes, > > i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things: > - htop was strange because all CPUs were doing nothing (totally nothing) > - there were enough of free memory > - server load was about 90 and was raising slowly > - i didn't see ANY process in 'run' state > - i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it > - i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem > - finally i did 'skill -kill apache2' in shell and everything started to work > - server monitoring wasn't sending any data so i have no graphs > - nothing interesting in logs > > I will send more info when i get some. Somebody else reported a problem on the upstream patches as well. Any chance you can confirm the stacks of the active but not running tasks? It sounds like they are stuck on a waitqueue, the question is which one. I forgot to disable OOM for __GFP_NOFAIL allocations, so they could succeed and leak an OOM context. task structs are not reinitialized between alloc & free so a different task could later try to oom trylock a memcg that has been freed, fail, and wait indefinitely on the OOM waitqueue. There might be a simpler explanation but I can't think of anything right now. But the OOM context is definitely being leaked, so please apply the following for your next reboot: --- mm/memcontrol.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5aee2fa..83ad39b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2341,6 +2341,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, */ if (!*ptr && !mm) goto bypass; + + if (gfp_mask & __GFP_NOFAIL) + oom = false; again: if (*ptr) { /* css should be a valid one */ memcg = *ptr; -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html