Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: > On 2022/1/13 13:47, Huang, Ying wrote: >> Minchan Kim <minchan@xxxxxxxxxx> writes: >> >>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >>>> Yu Zhao <yuzhao@xxxxxxxxxx> writes: >>>> >>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 163ac4e6bcee..8671de473c25 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>>>> >>>>>> /* MADV_FREE page check */ >>>>>> if (!PageSwapBacked(page)) { >>>>>> - if (!PageDirty(page)) { >>>>>> + int ref_count = page_ref_count(page); >>>>>> + int map_count = page_mapcount(page); >>>>>> + >>>>>> + /* >>>>>> + * The only page refs must be from the isolation >>>>>> + * (checked by the caller shrink_page_list() too) >>>>>> + * and one or more rmap's (dropped by discard:). >>>>>> + * >>>>>> + * Check the reference count before dirty flag >>>>>> + * with memory barrier; see __remove_mapping(). >>>>>> + */ >>>>>> + smp_rmb(); >>>>>> + if ((ref_count - 1 == map_count) && >>>>>> + !PageDirty(page)) { >>>>>> /* Invalidate as we cleared the pte */ >>>>>> mmu_notifier_invalidate_range(mm, >>>>>> address, address + PAGE_SIZE); >>>>> >>>>> Out of curiosity, how does it work with COW in terms of reordering? >>>>> Specifically, it seems to me get_page() and page_dup_rmap() in >>>>> copy_present_pte() can happen in any order, and if page_dup_rmap() >>>>> is seen first, and direct io is holding a refcnt, this check can still >>>>> pass? >>>> >>>> I think that you are correct. >>>> >>>> After more thoughts, it appears very tricky to compare page count and >>>> map count. Even if we have added smp_rmb() between page_ref_count() and >>>> page_mapcount(), an interrupt may happen between them. During the >>>> interrupt, the page count and map count may be changed, for example, >>>> unmapped, or do_swap_page(). >>> >>> Yeah, it happens but what specific problem are you concerning from the >>> count change under race? The fork case Yu pointed out was already known >>> for breaking DIO so user should take care not to fork under DIO(Please >>> look at O_DIRECT section in man 2 open). If you could give a specific >>> example, it would be great to think over the issue. >> >> Whether is the following race possible? >> >> CPU0/Process A CPU1/Process B >> -------------- -------------- >> try_to_unmap_one >> page_mapcount() >> zap_pte_range() >> page_remove_rmap() >> atomic_add_negative(-1, &page->_mapcount) >> tlb_flush_mmu() >> ... >> put_page_testzero() >> page_count() >> > > It seems they're under the same page table Lock. This is for a page shared by 2 processes (Process A/B above). But you reminded me that an anonymous page cannot be shared between multiple processes after direct IO. Because direct IO read un-shares the page, and fork() isn't allowed, I guess ksm isn't allowed too. So the above race isn't possible. Sorry for confusing. Best Regards, Huang, Ying > Thanks. > >> Previously I thought that there's similar race in do_swap_page(). But >> after more thoughts, I found that the page is locked in do_swap_page(). >> So do_swap_page() is safe. Per my understanding, except during fork() >> as Yu pointed out, the anonymous page must be locked before increasing >> its mapcount. >> >> So, if the above race is possible, we need to guarantee to read >> page_count() before page_mapcount(). That is, something as follows, >> >> count = page_count(); >> smp_rmb(); >> mapcount = page_mapcount(); >> if (!PageDirty(page) && mapcount + 1 == count) { >> ... >> } >> >> Best Regards, >> Huang, Ying >> >>> I agree it's little tricky but it seems to be way other place has used >>> for a long time(Please look at write_protect_page in ksm.c). >>> So, here what we missing is tlb flush before the checking. >>> >>> Something like this. >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index b0fd9dc19eba..b4ad9faa17b2 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> >>> /* MADV_FREE page check */ >>> if (!PageSwapBacked(page)) { >>> - int refcount = page_ref_count(page); >>> - >>> - /* >>> - * The only page refs must be from the isolation >>> - * (checked by the caller shrink_page_list() too) >>> - * and the (single) rmap (dropped by discard:). >>> - * >>> - * Check the reference count before dirty flag >>> - * with memory barrier; see __remove_mapping(). >>> - */ >>> - smp_rmb(); >>> - if (refcount == 2 && !PageDirty(page)) { >>> + if (!PageDirty(page) && >>> + page_mapcount(page) + 1 == page_count(page)) { >>> /* Invalidate as we cleared the pte */ >>> mmu_notifier_invalidate_range(mm, >>> address, address + PAGE_SIZE); >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index f3162a5724de..6454ff5c576f 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, >>> enum ttu_flags flags = TTU_BATCH_FLUSH; >>> bool was_swapbacked = PageSwapBacked(page); >>> >>> + if (!was_swapbacked && PageAnon(page)) >>> + flags &= ~TTU_BATCH_FLUSH; >>> + >>> if (unlikely(PageTransHuge(page))) >>> flags |= TTU_SPLIT_HUGE_PMD; >> . >>