Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 13 Jan 2022 16:04:10 +0800

Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:

> On 2022/1/13 13:47, Huang, Ying wrote:
>> Minchan Kim <minchan@xxxxxxxxxx> writes:
>> 
>>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>>> Yu Zhao <yuzhao@xxxxxxxxxx> writes:
>>>>
>>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 163ac4e6bcee..8671de473c25 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>>  
>>>>>>  			/* MADV_FREE page check */
>>>>>>  			if (!PageSwapBacked(page)) {
>>>>>> -				if (!PageDirty(page)) {
>>>>>> +				int ref_count = page_ref_count(page);
>>>>>> +				int map_count = page_mapcount(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The only page refs must be from the isolation
>>>>>> +				 * (checked by the caller shrink_page_list() too)
>>>>>> +				 * and one or more rmap's (dropped by discard:).
>>>>>> +				 *
>>>>>> +				 * Check the reference count before dirty flag
>>>>>> +				 * with memory barrier; see __remove_mapping().
>>>>>> +				 */
>>>>>> +				smp_rmb();
>>>>>> +				if ((ref_count - 1 == map_count) &&
>>>>>> +				    !PageDirty(page)) {
>>>>>>  					/* Invalidate as we cleared the pte */
>>>>>>  					mmu_notifier_invalidate_range(mm,
>>>>>>  						address, address + PAGE_SIZE);
>>>>>
>>>>> Out of curiosity, how does it work with COW in terms of reordering?
>>>>> Specifically, it seems to me get_page() and page_dup_rmap() in
>>>>> copy_present_pte() can happen in any order, and if page_dup_rmap()
>>>>> is seen first, and direct io is holding a refcnt, this check can still
>>>>> pass?
>>>>
>>>> I think that you are correct.
>>>>
>>>> After more thoughts, it appears very tricky to compare page count and
>>>> map count.  Even if we have added smp_rmb() between page_ref_count() and
>>>> page_mapcount(), an interrupt may happen between them.  During the
>>>> interrupt, the page count and map count may be changed, for example,
>>>> unmapped, or do_swap_page().
>>>
>>> Yeah, it happens but what specific problem are you concerning from the
>>> count change under race? The fork case Yu pointed out was already known
>>> for breaking DIO so user should take care not to fork under DIO(Please
>>> look at O_DIRECT section in man 2 open). If you could give a specific
>>> example, it would be great to think over the issue.
>> 
>> Whether is the following race possible?
>> 
>> CPU0/Process A                  CPU1/Process B
>> --------------                  --------------
>> try_to_unmap_one
>>   page_mapcount()
>>                                 zap_pte_range()
>>                                   page_remove_rmap()
>>                                     atomic_add_negative(-1, &page->_mapcount)
>>                                   tlb_flush_mmu()
>>                                     ...
>>                                       put_page_testzero()
>>   page_count()
>> 
>
> It seems they're under the same page table Lock.

This is for a page shared by 2 processes (Process A/B above).  But you
reminded me that an anonymous page cannot be shared between multiple
processes after direct IO.  Because direct IO read un-shares the page,
and fork() isn't allowed, I guess ksm isn't allowed too.

So the above race isn't possible.  Sorry for confusing.

Best Regards,
Huang, Ying

> Thanks.
>
>> Previously I thought that there's similar race in do_swap_page().  But
>> after more thoughts, I found that the page is locked in do_swap_page().
>> So do_swap_page() is safe.  Per my understanding, except during fork()
>> as Yu pointed out, the anonymous page must be locked before increasing
>> its mapcount.
>> 
>> So, if the above race is possible, we need to guarantee to read
>> page_count() before page_mapcount().  That is, something as follows,
>> 
>>         count = page_count();
>>         smp_rmb();
>>         mapcount = page_mapcount();
>>         if (!PageDirty(page) && mapcount + 1 == count) {
>>                 ...
>>         }
>> 
>> Best Regards,
>> Huang, Ying
>> 
>>> I agree it's little tricky but it seems to be way other place has used
>>> for a long time(Please look at write_protect_page in ksm.c).
>>> So, here what we missing is tlb flush before the checking.
>>>
>>> Something like this.
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index b0fd9dc19eba..b4ad9faa17b2 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>
>>>                         /* MADV_FREE page check */
>>>                         if (!PageSwapBacked(page)) {
>>> -                               int refcount = page_ref_count(page);
>>> -
>>> -                               /*
>>> -                                * The only page refs must be from the isolation
>>> -                                * (checked by the caller shrink_page_list() too)
>>> -                                * and the (single) rmap (dropped by discard:).
>>> -                                *
>>> -                                * Check the reference count before dirty flag
>>> -                                * with memory barrier; see __remove_mapping().
>>> -                                */
>>> -                               smp_rmb();
>>> -                               if (refcount == 2 && !PageDirty(page)) {
>>> +                               if (!PageDirty(page) &&
>>> +                                       page_mapcount(page) + 1 == page_count(page)) {
>>>                                         /* Invalidate as we cleared the pte */
>>>                                         mmu_notifier_invalidate_range(mm,
>>>                                                 address, address + PAGE_SIZE);
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f3162a5724de..6454ff5c576f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>>>                         bool was_swapbacked = PageSwapBacked(page);
>>>
>>> +                       if (!was_swapbacked && PageAnon(page))
>>> +                               flags &= ~TTU_BATCH_FLUSH;
>>> +
>>>                         if (unlikely(PageTransHuge(page)))
>>>                                 flags |= TTU_SPLIT_HUGE_PMD;
>> .
>>