Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Minchan Kim <minchan@xxxxxxxxxx> writes:

> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
>> Hi Minchan Kim,
>> 
>> Thanks for handling the hard questions! :)
>> 
>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>> >
>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>> > > Yu Zhao <yuzhao@xxxxxxxxxx> writes:
>> > >
>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
>> > > >> index 163ac4e6bcee..8671de473c25 100644
>> > > >> --- a/mm/rmap.c
>> > > >> +++ b/mm/rmap.c
>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> > > >>
>> > > >>                    /* MADV_FREE page check */
>> > > >>                    if (!PageSwapBacked(page)) {
>> > > >> -                          if (!PageDirty(page)) {
>> > > >> +                          int ref_count = page_ref_count(page);
>> > > >> +                          int map_count = page_mapcount(page);
>> > > >> +
>> > > >> +                          /*
>> > > >> +                           * The only page refs must be from the isolation
>> > > >> +                           * (checked by the caller shrink_page_list() too)
>> > > >> +                           * and one or more rmap's (dropped by discard:).
>> > > >> +                           *
>> > > >> +                           * Check the reference count before dirty flag
>> > > >> +                           * with memory barrier; see __remove_mapping().
>> > > >> +                           */
>> > > >> +                          smp_rmb();
>> > > >> +                          if ((ref_count - 1 == map_count) &&
>> > > >> +                              !PageDirty(page)) {
>> > > >>                                    /* Invalidate as we cleared the pte */
>> > > >>                                    mmu_notifier_invalidate_range(mm,
>> > > >>                                            address, address + PAGE_SIZE);
>> > > >
>> > > > Out of curiosity, how does it work with COW in terms of reordering?
>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
>> > > > is seen first, and direct io is holding a refcnt, this check can still
>> > > > pass?
>> > >
>> > > I think that you are correct.
>> > >
>> > > After more thoughts, it appears very tricky to compare page count and
>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
>> > > page_mapcount(), an interrupt may happen between them.  During the
>> > > interrupt, the page count and map count may be changed, for example,
>> > > unmapped, or do_swap_page().
>> >
>> > Yeah, it happens but what specific problem are you concerning from the
>> > count change under race? The fork case Yu pointed out was already known
>> > for breaking DIO so user should take care not to fork under DIO(Please
>> > look at O_DIRECT section in man 2 open). If you could give a specific
>> > example, it would be great to think over the issue.
>> >
>> > I agree it's little tricky but it seems to be way other place has used
>> > for a long time(Please look at write_protect_page in ksm.c).
>> 
>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>> 
>> > So, here what we missing is tlb flush before the checking.
>> 
>> That shouldn't be required for this particular issue/case, IIUIC.
>> One of the things we checked early on was disabling deferred TLB flush
>> (similarly to what you've done), and it didn't help with the issue; also, the
>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
>
> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> time?  Anyway, I agree we don't need TLB flush here like KSM.
> I think the reason KSM is doing TLB flush before the check it to
> make sure trap trigger on the write from userprocess in other core.
> However, this MADV_FREE case, HW already gaurantees the trap.
> Please see below.
>
>> 
>> 
>> >
>> > Something like this.
>> >
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index b0fd9dc19eba..b4ad9faa17b2 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >
>> >                         /* MADV_FREE page check */
>> >                         if (!PageSwapBacked(page)) {
>> > -                               int refcount = page_ref_count(page);
>> > -
>> > -                               /*
>> > -                                * The only page refs must be from the isolation
>> > -                                * (checked by the caller shrink_page_list() too)
>> > -                                * and the (single) rmap (dropped by discard:).
>> > -                                *
>> > -                                * Check the reference count before dirty flag
>> > -                                * with memory barrier; see __remove_mapping().
>> > -                                */
>> > -                               smp_rmb();
>> > -                               if (refcount == 2 && !PageDirty(page)) {
>> > +                               if (!PageDirty(page) &&
>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
>> 
>> In the interest of avoiding a different race/bug, it seemed worth following the
>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
>> after the page's reference count, with a memory barrier in between.
>
> True so it means your patch as-is is good for me.

If my understanding were correct, a shared anonymous page will be mapped
read-only.  If so, will a private anonymous page be called
SetPageDirty() concurrently after direct IO case has been dealt with
via comparing page_count()/page_mapcount()?

Best Regards,
Huang, Ying



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux