On Tue, Oct 30, 2012 at 6:27 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: > On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote: >> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >> > Hi Luigi, >> > >> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: >> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote: >> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: >> >> > >> >> >> However, now there is something that worries me more. The trace of >> >> >> the thread with TIF_MEMDIE set shows that it has executed most of >> >> >> do_exit() and appears to be waiting to be reaped. From my reading of >> >> >> the code, this implies that task->exit_state should be non-zero, which >> >> >> means that select_bad_process should have skipped that thread, which >> >> >> means that we cannot be in the deadlock situation, and my experiments >> >> >> are not consistent. >> >> >> >> >> > >> >> > Yeah, this is what I was referring to earlier, select_bad_process() will >> >> > not consider the thread for which you posted a stack trace for oom kill, >> >> > so it's not deferring because of it. There are either other thread(s) >> >> > that have been oom killed and have not yet release their memory or the oom >> >> > killer is never being called. >> >> >> >> Thanks. I now have better information on what's happening. >> >> >> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE >> >> set). It's another process that's exiting for some other reason. >> >> >> >> select_bad_process() checks for thread->exit_state at the beginning, >> >> and skips processes that are exiting. But later it checks for >> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for >> >> me). >> >> >> >> It turns out that do_exit() does a lot of things between setting the >> >> thread->flags PF_EXITING bit (in exit_signals()) and setting >> >> thread->exit_state to non-zero (in exit_notify()). Some of those >> >> things apparently need memory. I caught one process responsible for >> >> the PTR_ERR(-1) while it was doing this: >> >> >> >> [ 191.859358] VC manager R running 0 2388 1108 0x00000104 >> >> [ 191.859377] err_ptr_count = 45623 >> >> [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 >> >> 0000002c f67cfd20 >> >> [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 >> >> e1302400 e130264c >> >> [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 >> >> e0611b0c 810b430e >> >> [ 191.859450] Call Trace: >> >> [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 >> >> [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 >> >> [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf >> >> [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 >> >> [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d >> >> [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b >> >> [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 >> >> [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f >> >> [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 >> >> [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f >> >> [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf >> >> [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d >> >> [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb >> >> [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd >> >> [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 >> >> [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 >> >> [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa >> >> [ 191.859672] [<813b7887>] error_code+0x67/0x6c >> >> [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 >> >> [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 >> >> [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 >> >> [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 >> >> [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 >> >> [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 >> >> [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a >> >> [ 191.859760] [<81028082>] do_exit+0x19b/0x640 >> >> [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe >> >> [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c >> >> [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e >> >> [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 >> >> [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b >> >> [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe >> >> [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d >> >> [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 >> >> [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb >> >> [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 >> >> [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 >> >> [ 191.859893] Kernel panic - not syncing: too many ERR_PTR >> >> >> >> I don't know why mm_release() would page fault, but it looks like it does. >> >> >> >> So the OOM killer will not kill other processes because it thinks a >> >> process is exiting, which will free up memory. But the exiting >> >> process needs memory to continue exiting --> deadlock. Sounds >> >> plausible? >> > >> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. >> > If normal exited process in exit path requires a page and there is no free page >> > any more, it ends up going to OOM path after try to reclaim memory several time. >> > Then, >> > In select_bad_process, >> > >> > if (task->flags & PF_EXITING) { >> > if (task == current) <== true >> > return OOM_SCAN_SELECT; >> > In oom_kill_process, >> > >> > if (p->flags & PF_EXITING) >> > set_tsk_thread_flag(p, TIF_MEMDIE); >> > >> > At last, normal exited process would get a free page. >> > >> > But in your kernel, it seems not because I guess did_some_progress in >> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is >> > do_try_to_free_pages's all_unreclaimable can't do his role by your >> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. >> > >> > Sounds plausible? >> >> Thank you Minchan, it does sound plausible, but I have little >> experience with this and it will take some work to confirm. > > No problem :) > >> >> I looked at the patch pretty carefully once, and I had the impression >> its effect could be fully analyzed by logical reasoning. I will check >> this again tomorrow, perhaps I can run some experiments. I am adding >> Mandeep who wrote the patch. >> >> However, we have worse problems if we don't use that patch. Without >> the patch, and either with or without compressed swap, the same load >> causes horrible thrashing, with the system appearing to hang for >> minutes. If we don't use that patch, do you have any suggestion on >> how to improve the code thrash situation? > > As I said, the motivation of the patch is good for embedded system but > patch's implementation is kinda buggy. I will have a look and post if > I'm luck to get a time. > > BTW, a question. > > How do you find proper value for min_filelist_kbytes? > Just experiment with several trial? > > Thanks. Yes. Mandeep can give more detail, but, as I understand this, the value we use (50 Mb) was based on experimentation. It helps that at the moment we run Chrome OS on a relatively uniform set of devices, with either 2 or 4 GB of RAM, no swap, binaries stored on SSD (for backing store of text pages), and the same load (the Chrome browser). >> >> Thanks again! >> >> >> >> >> OK, now someone is going to fix this, right? :-) >> >> >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >> the body to majordomo@xxxxxxxxx. For more info on Linux MM, >> >> see: http://www.linux-mm.org/ . >> >> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> >> > >> > -- >> > Kind regards, >> > Minchan Kim >> > >> > -- >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in >> > the body to majordomo@xxxxxxxxx. For more info on Linux MM, >> > see: http://www.linux-mm.org/ . >> > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@xxxxxxxxx. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>