Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

Yu Zhao <yuzhao@xxxxxxxxxx> · Thu, 13 Jan 2022 00:29:51 -0700

On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira
<mfo@xxxxxxxxxxxxx> wrote:
>
> Hi Minchan Kim,
>
> Thanks for handling the hard questions! :)
>
> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > Yu Zhao <yuzhao@xxxxxxxxxx> writes:
> > >
> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > >> index 163ac4e6bcee..8671de473c25 100644
> > > >> --- a/mm/rmap.c
> > > >> +++ b/mm/rmap.c
> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >>
> > > >>                    /* MADV_FREE page check */
> > > >>                    if (!PageSwapBacked(page)) {
> > > >> -                          if (!PageDirty(page)) {
> > > >> +                          int ref_count = page_ref_count(page);
> > > >> +                          int map_count = page_mapcount(page);
> > > >> +
> > > >> +                          /*
> > > >> +                           * The only page refs must be from the isolation
> > > >> +                           * (checked by the caller shrink_page_list() too)
> > > >> +                           * and one or more rmap's (dropped by discard:).
> > > >> +                           *
> > > >> +                           * Check the reference count before dirty flag
> > > >> +                           * with memory barrier; see __remove_mapping().
> > > >> +                           */
> > > >> +                          smp_rmb();
> > > >> +                          if ((ref_count - 1 == map_count) &&
> > > >> +                              !PageDirty(page)) {
> > > >>                                    /* Invalidate as we cleared the pte */
> > > >>                                    mmu_notifier_invalidate_range(mm,
> > > >>                                            address, address + PAGE_SIZE);
> > > >
> > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > pass?
> > >
> > > I think that you are correct.
> > >
> > > After more thoughts, it appears very tricky to compare page count and
> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > page_mapcount(), an interrupt may happen between them.  During the
> > > interrupt, the page count and map count may be changed, for example,
> > > unmapped, or do_swap_page().
> >
> > Yeah, it happens but what specific problem are you concerning from the
> > count change under race? The fork case Yu pointed out was already known
> > for breaking DIO so user should take care not to fork under DIO(Please
> > look at O_DIRECT section in man 2 open). If you could give a specific
> > example, it would be great to think over the issue.
> >
> > I agree it's little tricky but it seems to be way other place has used
> > for a long time(Please look at write_protect_page in ksm.c).
>
> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>
> > So, here what we missing is tlb flush before the checking.
>
> That shouldn't be required for this particular issue/case, IIUIC.
> One of the things we checked early on was disabling deferred TLB flush
> (similarly to what you've done), and it didn't help with the issue; also, the
> issue happens on uniprocessor mode too (thus no remote CPU involved.)

Fast gup doesn't block tlb flush; it only blocks IPI used when freeing
page tables. So it's expected that forcing a tlb flush doesn't fix the
problem.

But it still seems to me the fix is missing smp_mb(). IIUC, a proper
fix would require, on the dio side
inc page refcnt
smp_mb()
read pte

and on the rmap side
clear pte
smp_mb()
read page refcnt

try_grab_compound_head() implies smp_mb, but i don't think
ptep_get_and_clear() does.

mapcount, as Minchan said, probably is irrelevant given dio is already
known to be broken with fork.

I glanced at the thread and thought it might be worth menthing.