Re: [PATCH] mm: khugepaged: fix potential page state corruption

Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> · Thu, 19 Mar 2020 09:57:47 -0700

On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:

On 3/18/20 5:55 PM, Yang Shi wrote:

On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
When khugepaged collapses anonymous pages, the base pages would
be freed
via pagevec or free_page_and_swap_cache().  But, the anonymous page may
be added back to LRU, then it might result in the below race:

     CPU A                CPU B
khugepaged:
    unlock page
    putback_lru_page
      add to lru
                 page reclaim:
                   isolate this page
                   try_to_unmap
    page_remove_rmap <-- corrupt _mapcount

It looks nothing would prevent the pages from isolating by reclaimer.
Hm. Why should it?

try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
protected by ptl. And this particular _mapcount pin is reachable for
reclaim as it's not part of usual page table tree. Basically
try_to_unmap() will never succeeds until we give up the _mapcount on
khugepaged side.
I don't quite get. What does "not part of usual page table tree" means?

How's about try_to_unmap() acquires ptl before khugepaged?
The page table we are dealing with was detached from the process' page
table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
pte.

try_to_unmap() can only reach the ptl if split ptl is disabled
(mm->page_table_lock is used), but it still will not be able to reach pte.

Aha, got it. Thanks for explaining. I definitely missed this point. Yes, 
pmdp_collapse_flush() would clear the pmd, then others won't see the 
page table.

However, it looks the vmscan would not stop at try_to_unmap() at all, 
try_to_unmap() would just return true since pmd_present() should return 
false in pvmw. Then it would go all the way down to __remove_mapping(), 
but freezing the page would fail since try_to_unmap() doesn't actually 
drop the refcount from the pte map.

It would not result in any critical problem AFAICT, but suboptimal and 
it may causes some unnecessary I/O due to swap.

I don't see the issue right away.

The other problem is the page's active or unevictable flag might be
still set when freeing the page via free_page_and_swap_cache().
So what?
The flags may leak to page free path then kernel may complain if
DEBUG_VM is set.
Could you elaborate on what codepath you are talking about?

__put_page ->
    __put_single_page ->
        free_unref_page ->
            put_unref_page_prepare ->
                free_pcp_prepare ->
                    free_pages_prepare ->
                        free_pages_check

This check would just be run when DEBUG_VM is enabled.

The putback_lru_page() would not clear those two flags if the pages are
released via pagevec, it sounds nothing prevents from isolating active
Sorry, this is a typo. If the page is freed via pagevec, active and
unevictable flag would get cleared before freeing by page_off_lru().

But, if the page is freed by free_page_and_swap_cache(), these two flags are
not cleared. But, it seems this path is hit rare, the pages are freed by
pagevec for the most cases.

or unevictable pages.
Again, why should it? vmscan is equipped to deal with this.
I don't mean vmscan, I mean khugepaged may isolate active and
unevictable pages since it just simply walks page table.
Why it is wrong? lru_cache_add() only complains if both flags set, it
shouldn't happen.

Noting wrong about isolating active or unevictable pages. I just mean it 
seems possible active or unevictable flag may be there if the page is 
freed via free_page_add_swap_cache() path.

However I didn't really run into these problems, just in theory
by visual
inspection.

And, it also seems unnecessary to have the pages add back to LRU
again since
they are about to be freed when reaching this point.  So,
clearing active
and unevictable flags, unlocking and dropping refcount from isolate
instead of calling putback_lru_page() as what page cache collapse does.
Hm? But we do call putback_lru_page() on the way out. I do not follow.
It just calls putback_lru_page() at error path, not success path.
Putting pages back to lru on error path definitely makes sense. Here it
is the success path.
I agree that putting the apage on LRU just before free the page is
suboptimal, but I don't see it as a critical issue.

Yes, given the code analysis above, I agree. If you thought the patch is 
a fine micro-optimization, I would like to re-submit it with rectified 
commit log. Thank you for your time.