On Fri, Oct 3, 2014 at 8:37 AM, Michal Hocko <mhocko@xxxxxxx> wrote: > On Thu 02-10-14 14:04:08, Cong Wang wrote: >> Hello again, >> >> I realized it is a series of patch actually: >> >> 3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap >> chargers with full callstack on OOM >> fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and >> document OOM waiting and wakeup >> 519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM >> killer only for user faults >> 3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error >> path with fatal signal >> 759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace >> fault flag to generic fault handler >> 871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM >> killer on kernel fault OOM >> 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete >> init OOM protection > > Yes, that looks like the full series. > >> I am not sure if they have more dependencies. >> >> However, this bug is *fairly* easy to reproduce on 3.10, just using the >> following script: >> >> #!/bin/bash >> >> TEST_DIR=/tmp/cgroup_test >> [ -d $TEST_DIR ] || mkdir -p $TEST_DIR >> mount -t cgroup none $TEST_DIR -o memory >> mkdir $TEST_DIR/test >> echo 512k > $TEST_DIR/test/memory.limit_in_bytes > > This is just insane. You allow only 128 pages to be charged and the > reclaim will have to constantly wait for each page to finish the > writeback. This is a test case ONLY used to reproduce this bug, why it has to be sane? :) On the other hand, no matter how insane a test case is, as long as it triggers some hung tasks in kernel, it is a kernel bug needs to fix. > >> dd if=/dev/zero of=/tmp/oom_test_big_file bs=512 count=20000000 & >> echo $! > $TEST_DIR/test/tasks >> rm -f /tmp/oom_test_big_file >> umount $TEST_DIR >> >> >> Run it like this: >> >> for i in `seq 1 1000`; do ./oom_hung.sh ; done > > OK, so you will eventually deplete the limit by anon charges if the pid > makes it into the group sooner than dd allocates its 512B buffer (which > will end up consuming the full page anyway). So the OOM is pretty much > unavoidable. All the task will have minimum rss so then it is just a > matter of luck which one gets killed. But this alone shouldn't cause a > dead lock. Are you really sure this is the same issue discussed in the > mentioned patch? Why not? OOM killer tries to kill a process sleeping on a mutex it already holds, why not a deadlock? Given the fact that both are lots of inode mutex hung because of OOM, I am 90% sure they are the same. > >> So please consider this seriously. :) > > The bug is there since the memory controller has been introduced. Yet we > only had a single report happening in the real life. So I do not think > this is that urgent. It was definitely not a good design decision that > OOM killer was handled on top of unknown locks which might prevent from > forward progress. No question about that. Do you see the problem in the > real life somewhere because to be honest the test case is pretty much > insane. I am sorry to confuse you that it is my the above test case which caused this bug. No, we saw this bug in *production* in our data center, it happened on 30+ machines!! :) The above insane test case is ONLY to draw your attention on how serious the bug is, nothing else. BTW, I don't spend my working time to debug a problem in non-real world, it must be a bug in real world, that is in our data center. Thanks. -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html