Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > For most of the page walk paths, logically it'll always be good to have the
> > > pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
> > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > > even treating that as a none pmd.  If to fix it anyway, a retry will be the
> > > most accurate.
> > >
> > > I've went over all the pmd_trans_unstable() special cases and this patch
> > > should cover all the rest places where we should retry properly with
> > > unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
> > > easily achieve that.
> > >
> > > These are the call sites that I think should be fixed with it:
> > >
> > > *** fs/proc/task_mmu.c:
> > > smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
> > > clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
> > > pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
> > > gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
> > > *** mm/memcontrol.c:
> > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > > *** mm/memory-failure.c:
> > > hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
> > > *** mm/mempolicy.c:
> > > queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
> > > *** mm/madvise.c:
> > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > > madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))
> > >
> > > IIUC most of them may or may not be a big issue even without a retry,
> > > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > > cold worst case), but some of them could have functional error without the
> > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> > >
> > > While these call sites all look fine, and don't need any change:
> > >
> > > *** include/linux/pgtable.h:
> > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > > *** mm/gup.c:
> > > follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
> > > *** mm/mapping_dirty_helpers.c:
> > > wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
> > > *** mm/memory.c:
> > > do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > > *** mm/migrate_device.c:
> > > migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
> > > *** mm/mincore.c:
> > > mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {
> > >
> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > ---
> > >  fs/proc/task_mmu.c  | 17 +++++++++++++----
> > >  mm/madvise.c        |  8 ++++++--
> > >  mm/memcontrol.c     |  8 ++++++--
> > >  mm/memory-failure.c |  4 +++-
> > >  mm/mempolicy.c      |  4 +++-
> > >  5 files changed, 31 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 6259dd432eeb..823eaba5c6bf 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > >                 goto out;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmd))
> > > +       if (pmd_trans_unstable(pmd)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 goto out;
> > > +       }
> > > +
> > >         /*
> > >          * The mmap_lock held all the way back in m_start() is what
> > >          * keeps khugepaged out of here and from collapsing things
> > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > >                 return 0;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmd))
> > > +       if (pmd_trans_unstable(pmd)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 return 0;
> > > +       }
> > >
> > >         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > >         for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > >                 return err;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmdp))
> > > +       if (pmd_trans_unstable(pmdp)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 return 0;
> >
> > Had a quick look at the pagemap code, I agree with your analysis,
> > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > wondering whether we should just fill in empty entries. Anyway I don't
> > have a  strong opinion on this, just a little bit concerned by
> > potential indefinite retry.
>
> Yes, none pte is still an option.  But if we're going to fix this anyway,
> it seems better to fix it with the accurate new thing that poped up, and
> it's even less change (just apply walk->action rather than doing random
> stuff in different call sites).
>
> I see that you have worry on deadloop over this, so I hope to discuss
> altogether here.
>
> Unlike normal checks, pmd_trans_unstable() check means something must have
> changed in the past very short period or it should just never if nothing
> changed concurrently from under us, so it's not a "if (flag==true)" check
> which is even more likely to loop.
>
> If we see the places that I didn't touch, most of them suggested a retry in
> one form or another.  So if there's a worry this will also not the first
> time to do a retry (and for such a "unstable" API, that's really the most
> natural thing to do which is to retry until it's stable).

IIUC other than do_anonymous_page() suggests retry (retry page fault),
others may not, for example:
  - follow_pmd_mask: return -EBUSY
  - wp_clean_pmd_entry: actually just retry for pmd_none case, but the
pagewalk code does handle pmd_none by skipping it, so it basically
just retry once
  - min_core_pte_range: treated as unmapped range by calling
__mincore_unmapped_range

Anyway I really don't have a strong opinion on this. I may be just
over-concerned. I just thought if nobody cares whether the result is
accurate or not, why do we bother fixing those cases?

>
> So in general, it seems to me if we deadloop over pmd_trans_unstable() for
> whatever reason then something more wrong could have happened..
>
> Thanks,
>
> --
> Peter Xu
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux