On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>