On 12/13/2013 09:49 AM, Bob Liu wrote: > On 12/13/2013 05:05 AM, Sasha Levin wrote: >> On 12/12/2013 07:41 AM, Vlastimil Babka wrote: >>> On 12/12/2013 06:03 AM, Bob Liu wrote: >>>> >>>> On 12/12/2013 11:16 AM, Sasha Levin wrote: >>>>> On 12/11/2013 05:59 PM, Vlastimil Babka wrote: >>>>>> On 12/09/2013 09:26 PM, Sasha Levin wrote: >>>>>>> On 12/09/2013 12:12 PM, Vlastimil Babka wrote: >>>>>>>> On 12/09/2013 06:05 PM, Sasha Levin wrote: >>>>>>>>> On 12/09/2013 04:34 AM, Vlastimil Babka wrote: >>>>>>>>>> Hello, I will look at it, thanks. >>>>>>>>>> Do you have specific reproduction instructions? >>>>>>>>> >>>>>>>>> Not really, the fuzzer hit it once and I've been unable to trigger >>>>>>>>> it again. Looking at >>>>>>>>> the piece of code involved it might have had something to do with >>>>>>>>> hugetlbfs, so I'll crank >>>>>>>>> up testing on that part. >>>>>>>> >>>>>>>> Thanks. Do you have trinity log and the .config file? I'm currently >>>>>>>> unable to even boot linux-next >>>>>>>> with my config/setup due to a GPF. >>>>>>>> Looking at code I wouldn't expect that it could encounter a tail >>>>>>>> page, without first encountering a >>>>>>>> head page and skipping the whole huge page. At least in THP case, as >>>>>>>> TLB pages should be split when >>>>>>>> a vma is split. As for hugetlbfs, it should be skipped for >>>>>>>> mlock/munlock operations completely. One >>>>>>>> of these assumptions is probably failing here... >>>>>>> >>>>>>> If it helps, I've added a dump_page() in case we hit a tail page >>>>>>> there and got: >>>>>>> >>>>>>> [ 980.172299] page:ffffea003e5e8040 count:0 mapcount:1 >>>>>>> mapping: (null) index:0 >>>>>>> x0 >>>>>>> [ 980.173412] page flags: 0x2fffff80008000(tail) >>>>>>> >>>>>>> I can also add anything else in there to get other debug output if >>>>>>> you think of something else useful. >>>>>> >>>>>> Please try the following. Thanks in advance. >>>>> >>>>> [ 428.499889] page:ffffea003e5c0040 count:0 mapcount:4 >>>>> mapping: (null) index:0x0 >>>>> [ 428.499889] page flags: 0x2fffff80008000(tail) >>>>> [ 428.499889] start=140117131923456 pfn=16347137 >>>>> orig_start=140117130543104 page_increm >>>>> =1 vm_start=140117130543104 vm_end=140117134688256 vm_flags=135266419 >>>>> [ 428.499889] first_page pfn=16347136 >>>>> [ 428.499889] page:ffffea003e5c0000 count:204 mapcount:44 >>>>> mapping:ffff880fb5c466c1 inde >>>>> x:0x7f6f8fe00 >>>>> [ 428.499889] page flags: >>>>> 0x2fffff80084068(uptodate|lru|active|head|swapbacked) >>>> >>>> From this print, it looks like the page is still a huge page. >>>> One situation I guess is a huge page which isn't PageMlocked and passed >>>> to munlock_vma_page(). I'm not sure whether this will happen. >>> >>> Yes that's quite likely the case. It's not illegal to happen I would say. >>> >>>> Please take a try this patch. >>> >>> I've made a simpler version that does away with the ugly page_mask >>> thing completely. >>> Please try that as well. Thanks. >>> >>> Also when working on this I think I found another potential but much >>> rare problem >>> when munlock_vma_page races with a THP split. That would however >>> manifest such that >>> part of the former tail pages would stay PageMlocked. But that still >>> needs more thought. >>> The bug at hand should however be fixed by this patch. >> >> Yup, this patch seems to fix the issue previously reported. >> >> However, I'll piggyback another thing that popped up now that the vm >> could run for a while which >> also seems to be caused by the original patch. It looks like a pretty >> straightforward deadlock, but Sigh, put one down, patch it around... :) > Looks like put_page() in __munlock_pagevec() need to get the > zone->lru_lock which is already held when entering __munlock_pagevec(). I've come to the same conclusion, however: > How about fix like this? That unfortunately removes most of the purpose of this function which was to avoid repeated locking. Please try this patch. -------8<------- From: Vlastimil Babka <vbabka@xxxxxxx> Date: Fri, 13 Dec 2013 10:03:25 +0100 Subject: [PATCH] Deadlock in __munlock_pagevec candidate fix --- mm/mlock.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index a34dfdc..c97273e 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -281,10 +281,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) { int i; int nr = pagevec_count(pvec); - int delta_munlocked = -nr; + int delta_munlocked; struct pagevec pvec_putback; int pgrescued = 0; + pagevec_init(&pvec_putback, 0); + /* Phase 1: page isolation */ spin_lock_irq(&zone->lru_lock); for (i = 0; i < nr; i++) { @@ -313,16 +315,22 @@ skip_munlock: /* * We won't be munlocking this page in the next phase * but we still need to release the follow_page_mask() - * pin. + * pin. We cannot do it under lru_lock however. If it's + * the last pin, __page_cache_release would deadlock. */ + pagevec_add(&pvec_putback, pvec->pages[i]); pvec->pages[i] = NULL; - put_page(page); - delta_munlocked++; } } + delta_munlocked = -nr + pagevec_count(&pvec_putback); __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); spin_unlock_irq(&zone->lru_lock); + /* Now we can release pins of pages that we are not munlocking */ + for (i = 0; i < pagevec_count(&pvec_putback); i++) { + put_page(pvec_putback.pages[i]); + } + /* Phase 2: page munlock */ pagevec_init(&pvec_putback, 0); for (i = 0; i < nr; i++) { -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>