On 12/13/2013 05:05 AM, Sasha Levin wrote: > On 12/12/2013 07:41 AM, Vlastimil Babka wrote: >> On 12/12/2013 06:03 AM, Bob Liu wrote: >>> >>> On 12/12/2013 11:16 AM, Sasha Levin wrote: >>>> On 12/11/2013 05:59 PM, Vlastimil Babka wrote: >>>>> On 12/09/2013 09:26 PM, Sasha Levin wrote: >>>>>> On 12/09/2013 12:12 PM, Vlastimil Babka wrote: >>>>>>> On 12/09/2013 06:05 PM, Sasha Levin wrote: >>>>>>>> On 12/09/2013 04:34 AM, Vlastimil Babka wrote: >>>>>>>>> Hello, I will look at it, thanks. >>>>>>>>> Do you have specific reproduction instructions? >>>>>>>> >>>>>>>> Not really, the fuzzer hit it once and I've been unable to trigger >>>>>>>> it again. Looking at >>>>>>>> the piece of code involved it might have had something to do with >>>>>>>> hugetlbfs, so I'll crank >>>>>>>> up testing on that part. >>>>>>> >>>>>>> Thanks. Do you have trinity log and the .config file? I'm currently >>>>>>> unable to even boot linux-next >>>>>>> with my config/setup due to a GPF. >>>>>>> Looking at code I wouldn't expect that it could encounter a tail >>>>>>> page, without first encountering a >>>>>>> head page and skipping the whole huge page. At least in THP case, as >>>>>>> TLB pages should be split when >>>>>>> a vma is split. As for hugetlbfs, it should be skipped for >>>>>>> mlock/munlock operations completely. One >>>>>>> of these assumptions is probably failing here... >>>>>> >>>>>> If it helps, I've added a dump_page() in case we hit a tail page >>>>>> there and got: >>>>>> >>>>>> [ 980.172299] page:ffffea003e5e8040 count:0 mapcount:1 >>>>>> mapping: (null) index:0 >>>>>> x0 >>>>>> [ 980.173412] page flags: 0x2fffff80008000(tail) >>>>>> >>>>>> I can also add anything else in there to get other debug output if >>>>>> you think of something else useful. >>>>> >>>>> Please try the following. Thanks in advance. >>>> >>>> [ 428.499889] page:ffffea003e5c0040 count:0 mapcount:4 >>>> mapping: (null) index:0x0 >>>> [ 428.499889] page flags: 0x2fffff80008000(tail) >>>> [ 428.499889] start=140117131923456 pfn=16347137 >>>> orig_start=140117130543104 page_increm >>>> =1 vm_start=140117130543104 vm_end=140117134688256 vm_flags=135266419 >>>> [ 428.499889] first_page pfn=16347136 >>>> [ 428.499889] page:ffffea003e5c0000 count:204 mapcount:44 >>>> mapping:ffff880fb5c466c1 inde >>>> x:0x7f6f8fe00 >>>> [ 428.499889] page flags: >>>> 0x2fffff80084068(uptodate|lru|active|head|swapbacked) >>> >>> From this print, it looks like the page is still a huge page. >>> One situation I guess is a huge page which isn't PageMlocked and passed >>> to munlock_vma_page(). I'm not sure whether this will happen. >> >> Yes that's quite likely the case. It's not illegal to happen I would say. >> >>> Please take a try this patch. >> >> I've made a simpler version that does away with the ugly page_mask >> thing completely. >> Please try that as well. Thanks. >> >> Also when working on this I think I found another potential but much >> rare problem >> when munlock_vma_page races with a THP split. That would however >> manifest such that >> part of the former tail pages would stay PageMlocked. But that still >> needs more thought. >> The bug at hand should however be fixed by this patch. > > Yup, this patch seems to fix the issue previously reported. > > However, I'll piggyback another thing that popped up now that the vm > could run for a while which > also seems to be caused by the original patch. It looks like a pretty > straightforward deadlock, but Looks like put_page() in __munlock_pagevec() need to get the zone->lru_lock which is already held when entering __munlock_pagevec(). How about fix like this? Thanks, -Bob diff --git a/mm/mlock.c b/mm/mlock.c index d480cd6..5880d63 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -291,7 +291,6 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) int pgrescued = 0; /* Phase 1: page isolation */ - spin_lock_irq(&zone->lru_lock); for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; @@ -300,6 +299,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) int lru; if (PageLRU(page)) { + spin_lock_irq(&zone->lru_lock); lruvec = mem_cgroup_page_lruvec(page, zone); lru = page_lru(page); /* @@ -308,6 +308,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) */ ClearPageLRU(page); del_page_from_lru_list(page, lruvec, lru); + spin_unlock_irq(&zone->lru_lock); } else { __munlock_isolation_failed(page); goto skip_munlock; @@ -325,8 +326,7 @@ skip_munlock: delta_munlocked++; } } - __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); - spin_unlock_irq(&zone->lru_lock); + mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); /* Phase 2: page munlock */ pagevec_init(&pvec_putback, 0); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>