On 02/07/2014 01:45 AM, gregkh@xxxxxxxxxxxxxxxxxxx wrote: > > This is a note to let you know that I've just added the patch titled > > mm: munlock: fix potential race with THP page split > > to the 3.13-stable tree which can be found at: > http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary > > The filename of the patch is: > mm-munlock-fix-potential-race-with-thp-page-split.patch > and it can be found in the queue-3.13 subdirectory. > > If you, or anyone else, feels it should not be added to the stable tree, > please let <stable@xxxxxxxxxxxxxxx> know about it. Hello, I realized this probably doesn't meet stable kernel rules of criticality and the race being actually observed to happen, so feel free to drop it. Vlastimil > From 01cc2e58697e34c6ee9a40fb6cebc18bf5a1923f Mon Sep 17 00:00:00 2001 > From: Vlastimil Babka <vbabka@xxxxxxx> > Date: Thu, 23 Jan 2014 15:52:50 -0800 > Subject: mm: munlock: fix potential race with THP page split > > From: Vlastimil Babka <vbabka@xxxxxxx> > > commit 01cc2e58697e34c6ee9a40fb6cebc18bf5a1923f upstream. > > Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP > pages") munlock skips tail pages of a munlocked THP page. There is some > attempt to prevent bad consequences of racing with a THP page split, but > code inspection indicates that there are two problems that may lead to a > non-fatal, yet wrong outcome. > > First, __split_huge_page_refcount() copies flags including PageMlocked > from the head page to the tail pages. Clearing PageMlocked by > munlock_vma_page() in the middle of this operation might result in part > of tail pages left with PageMlocked flag. As the head page still > appears to be a THP page until all tail pages are processed, > munlock_vma_page() might think it munlocked the whole THP page and skip > all the former tail pages. Before ff6a6da60, those pages would be > cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK > would still become undercounted (related the next point). > > Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after > the PageMlocked is cleared. The accounting might also become > inconsistent due to race with __split_huge_page_refcount() > > - undercount when HUGE_PMD_NR is subtracted, but some tail pages are > left with PageMlocked set and counted again (only possible before > ff6a6da60) > > - overcount when hpage_nr_pages() sees a normal page (split has already > finished), but the parallel split has meanwhile cleared PageMlocked from > additional tail pages > > This patch prevents both problems via extending the scope of lru_lock in > munlock_vma_page(). This is convenient because: > > - __split_huge_page_refcount() takes lru_lock for its whole operation > > - munlock_vma_page() typically takes lru_lock anyway for page isolation > > As this becomes a second function where page isolation is done with > lru_lock already held, factor this out to a new > __munlock_isolate_lru_page() function and clean up the code around. > > [akpm@xxxxxxxxxxxxxxxxxxxx: avoid a coding-style ugly] > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Sasha Levin <sasha.levin@xxxxxxxxxx> > Cc: Michel Lespinasse <walken@xxxxxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Rik van Riel <riel@xxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > > --- > mm/mlock.c | 104 +++++++++++++++++++++++++++++++++++-------------------------- > 1 file changed, 60 insertions(+), 44 deletions(-) > > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -91,6 +91,26 @@ void mlock_vma_page(struct page *page) > } > > /* > + * Isolate a page from LRU with optional get_page() pin. > + * Assumes lru_lock already held and page already pinned. > + */ > +static bool __munlock_isolate_lru_page(struct page *page, bool getpage) > +{ > + if (PageLRU(page)) { > + struct lruvec *lruvec; > + > + lruvec = mem_cgroup_page_lruvec(page, page_zone(page)); > + if (getpage) > + get_page(page); > + ClearPageLRU(page); > + del_page_from_lru_list(page, lruvec, page_lru(page)); > + return true; > + } > + > + return false; > +} > + > +/* > * Finish munlock after successful page isolation > * > * Page must be locked. This is a wrapper for try_to_munlock() > @@ -126,9 +146,9 @@ static void __munlock_isolated_page(stru > static void __munlock_isolation_failed(struct page *page) > { > if (PageUnevictable(page)) > - count_vm_event(UNEVICTABLE_PGSTRANDED); > + __count_vm_event(UNEVICTABLE_PGSTRANDED); > else > - count_vm_event(UNEVICTABLE_PGMUNLOCKED); > + __count_vm_event(UNEVICTABLE_PGMUNLOCKED); > } > > /** > @@ -152,28 +172,34 @@ static void __munlock_isolation_failed(s > unsigned int munlock_vma_page(struct page *page) > { > unsigned int nr_pages; > + struct zone *zone = page_zone(page); > > BUG_ON(!PageLocked(page)); > > - if (TestClearPageMlocked(page)) { > - nr_pages = hpage_nr_pages(page); > - mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); > - if (!isolate_lru_page(page)) > - __munlock_isolated_page(page); > - else > - __munlock_isolation_failed(page); > - } else { > - nr_pages = hpage_nr_pages(page); > - } > - > /* > - * Regardless of the original PageMlocked flag, we determine nr_pages > - * after touching the flag. This leaves a possible race with a THP page > - * split, such that a whole THP page was munlocked, but nr_pages == 1. > - * Returning a smaller mask due to that is OK, the worst that can > - * happen is subsequent useless scanning of the former tail pages. > - * The NR_MLOCK accounting can however become broken. > + * Serialize with any parallel __split_huge_page_refcount() which > + * might otherwise copy PageMlocked to part of the tail pages before > + * we clear it in the head page. It also stabilizes hpage_nr_pages(). > */ > + spin_lock_irq(&zone->lru_lock); > + > + nr_pages = hpage_nr_pages(page); > + if (!TestClearPageMlocked(page)) > + goto unlock_out; > + > + __mod_zone_page_state(zone, NR_MLOCK, -nr_pages); > + > + if (__munlock_isolate_lru_page(page, true)) { > + spin_unlock_irq(&zone->lru_lock); > + __munlock_isolated_page(page); > + goto out; > + } > + __munlock_isolation_failed(page); > + > +unlock_out: > + spin_unlock_irq(&zone->lru_lock); > + > +out: > return nr_pages - 1; > } > > @@ -310,34 +336,24 @@ static void __munlock_pagevec(struct pag > struct page *page = pvec->pages[i]; > > if (TestClearPageMlocked(page)) { > - struct lruvec *lruvec; > - int lru; > - > - if (PageLRU(page)) { > - lruvec = mem_cgroup_page_lruvec(page, zone); > - lru = page_lru(page); > - /* > - * We already have pin from follow_page_mask() > - * so we can spare the get_page() here. > - */ > - ClearPageLRU(page); > - del_page_from_lru_list(page, lruvec, lru); > - } else { > - __munlock_isolation_failed(page); > - goto skip_munlock; > - } > - > - } else { > -skip_munlock: > /* > - * We won't be munlocking this page in the next phase > - * but we still need to release the follow_page_mask() > - * pin. We cannot do it under lru_lock however. If it's > - * the last pin, __page_cache_release would deadlock. > + * We already have pin from follow_page_mask() > + * so we can spare the get_page() here. > */ > - pagevec_add(&pvec_putback, pvec->pages[i]); > - pvec->pages[i] = NULL; > + if (__munlock_isolate_lru_page(page, false)) > + continue; > + else > + __munlock_isolation_failed(page); > } > + > + /* > + * We won't be munlocking this page in the next phase > + * but we still need to release the follow_page_mask() > + * pin. We cannot do it under lru_lock however. If it's > + * the last pin, __page_cache_release() would deadlock. > + */ > + pagevec_add(&pvec_putback, pvec->pages[i]); > + pvec->pages[i] = NULL; > } > delta_munlocked = -nr + pagevec_count(&pvec_putback); > __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); > > > Patches currently in stable-queue which might be from vbabka@xxxxxxx are > > queue-3.13/mm-munlock-fix-potential-race-with-thp-page-split.patch > -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html