2nd description trial. >From ccfc6c79634f6cec69d8fb23b0e863ebfa5b893c Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan@xxxxxxxxxx> Date: Mon, 30 Mar 2015 13:43:08 +0900 Subject: [PATCH v2] mm: make every pte dirty on do_swap_page Bascially, MADV_FREE relys on the dirty bit in page table entry to decide whether VM allows to discard the page or not. IOW, if page table entry includes marked dirty bit, VM shouldn't discard the page. However, if swap-in by read fault happens, page table entry point out the page doesn't have marked dirty bit so MADV_FREE might discard the page wrongly. For avoiding the problem, MADV_FREE did more checks with PageDirty and PageSwapCache. It worked out because swapped-in page lives on swap cache and since it was evicted from the swap cache, the page has PG_dirty flag. So both page flags checks effectvely prevent wrong discarding by MADV_FREE. A problem in above logic is that swapped-in page has PG_dirty since they are removed from swap cache so VM cannot consider those pages as freeable any more alghouth madvise_free is called in future. Look at below example for detail. ptr = malloc(); memset(ptr); .. .. .. heavy memory pressure so all of pages are swapped out .. .. var = *ptr; -> a page swapped-in and removed from swapcache. page table doesn't mark dirty bit and page descriptor includes PG_dirty .. .. madvise_free(ptr); .. .. .. .. heavy memory pressure again. .. In this time, VM cannot discard the page because the page .. has *PG_dirty* Rather than relying on the PG_dirty of page descriptor for preventing discarding a page, dirty bit in page table is more straightforward and simple. So, this patch makes page table dirty bit marked whenever swap-in happens. Inherenty, page table entry point out swapped-out page had dirty bit so I think it's no prblem. With this, it removes complicated logic and makes freeable page checking by madvise_free simple. Of course, we could solve above mentioned example. Cc: Hugh Dickins <hughd@xxxxxxxxxx> Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxx> Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx> Reported-by: Yalin Wang <yalin.wang@xxxxxxxxxxxxxx> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> --- * From v1: * Rewrite description - Andrew mm/madvise.c | 1 - mm/memory.c | 10 ++++++++-- mm/rmap.c | 2 +- mm/vmscan.c | 3 +-- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 6743966..48ff537 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma->vm_page_prot); + + /* + * The page is swapping in now was dirty before it was swapped out + * so restore the state again(ie, pte_mkdirty) because MADV_FREE + * relies on the dirty bit on page table. + */ + pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index dad23a4..281e806 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1275,7 +1275,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty && !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index dc6cd51..fffebf0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && - !PageDirty(page)) + if (PageAnon(page) && !pte_dirty && !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>