On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' > min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 > 533 [<ffffffffffffffff>] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [<ffffffff811109b8>] do_truncate+0x58/0xa0 > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>