Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022/1/13 13:47, Huang, Ying wrote:
> Minchan Kim <minchan@xxxxxxxxxx> writes:
> 
>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>> Yu Zhao <yuzhao@xxxxxxxxxx> writes:
>>>
>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 163ac4e6bcee..8671de473c25 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>  
>>>>>  			/* MADV_FREE page check */
>>>>>  			if (!PageSwapBacked(page)) {
>>>>> -				if (!PageDirty(page)) {
>>>>> +				int ref_count = page_ref_count(page);
>>>>> +				int map_count = page_mapcount(page);
>>>>> +
>>>>> +				/*
>>>>> +				 * The only page refs must be from the isolation
>>>>> +				 * (checked by the caller shrink_page_list() too)
>>>>> +				 * and one or more rmap's (dropped by discard:).
>>>>> +				 *
>>>>> +				 * Check the reference count before dirty flag
>>>>> +				 * with memory barrier; see __remove_mapping().
>>>>> +				 */
>>>>> +				smp_rmb();
>>>>> +				if ((ref_count - 1 == map_count) &&
>>>>> +				    !PageDirty(page)) {
>>>>>  					/* Invalidate as we cleared the pte */
>>>>>  					mmu_notifier_invalidate_range(mm,
>>>>>  						address, address + PAGE_SIZE);
>>>>
>>>> Out of curiosity, how does it work with COW in terms of reordering?
>>>> Specifically, it seems to me get_page() and page_dup_rmap() in
>>>> copy_present_pte() can happen in any order, and if page_dup_rmap()
>>>> is seen first, and direct io is holding a refcnt, this check can still
>>>> pass?
>>>
>>> I think that you are correct.
>>>
>>> After more thoughts, it appears very tricky to compare page count and
>>> map count.  Even if we have added smp_rmb() between page_ref_count() and
>>> page_mapcount(), an interrupt may happen between them.  During the
>>> interrupt, the page count and map count may be changed, for example,
>>> unmapped, or do_swap_page().
>>
>> Yeah, it happens but what specific problem are you concerning from the
>> count change under race? The fork case Yu pointed out was already known
>> for breaking DIO so user should take care not to fork under DIO(Please
>> look at O_DIRECT section in man 2 open). If you could give a specific
>> example, it would be great to think over the issue.
> 
> Whether is the following race possible?
> 
> CPU0/Process A                  CPU1/Process B
> --------------                  --------------
> try_to_unmap_one
>   page_mapcount()
>                                 zap_pte_range()
>                                   page_remove_rmap()
>                                     atomic_add_negative(-1, &page->_mapcount)
>                                   tlb_flush_mmu()
>                                     ...
>                                       put_page_testzero()
>   page_count()
> 

It seems they're under the same page table Lock.

Thanks.

> Previously I thought that there's similar race in do_swap_page().  But
> after more thoughts, I found that the page is locked in do_swap_page().
> So do_swap_page() is safe.  Per my understanding, except during fork()
> as Yu pointed out, the anonymous page must be locked before increasing
> its mapcount.
> 
> So, if the above race is possible, we need to guarantee to read
> page_count() before page_mapcount().  That is, something as follows,
> 
>         count = page_count();
>         smp_rmb();
>         mapcount = page_mapcount();
>         if (!PageDirty(page) && mapcount + 1 == count) {
>                 ...
>         }
> 
> Best Regards,
> Huang, Ying
> 
>> I agree it's little tricky but it seems to be way other place has used
>> for a long time(Please look at write_protect_page in ksm.c).
>> So, here what we missing is tlb flush before the checking.
>>
>> Something like this.
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index b0fd9dc19eba..b4ad9faa17b2 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>
>>                         /* MADV_FREE page check */
>>                         if (!PageSwapBacked(page)) {
>> -                               int refcount = page_ref_count(page);
>> -
>> -                               /*
>> -                                * The only page refs must be from the isolation
>> -                                * (checked by the caller shrink_page_list() too)
>> -                                * and the (single) rmap (dropped by discard:).
>> -                                *
>> -                                * Check the reference count before dirty flag
>> -                                * with memory barrier; see __remove_mapping().
>> -                                */
>> -                               smp_rmb();
>> -                               if (refcount == 2 && !PageDirty(page)) {
>> +                               if (!PageDirty(page) &&
>> +                                       page_mapcount(page) + 1 == page_count(page)) {
>>                                         /* Invalidate as we cleared the pte */
>>                                         mmu_notifier_invalidate_range(mm,
>>                                                 address, address + PAGE_SIZE);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f3162a5724de..6454ff5c576f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>>                         bool was_swapbacked = PageSwapBacked(page);
>>
>> +                       if (!was_swapbacked && PageAnon(page))
>> +                               flags &= ~TTU_BATCH_FLUSH;
>> +
>>                         if (unlikely(PageTransHuge(page)))
>>                                 flags |= TTU_SPLIT_HUGE_PMD;
> .
> 




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux