On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote: > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > For most of the page walk paths, logically it'll always be good to have the > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not > > even treating that as a none pmd. If to fix it anyway, a retry will be the > > most accurate. > > > > I've went over all the pmd_trans_unstable() special cases and this patch > > should cover all the rest places where we should retry properly with > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can > > easily achieve that. > > > > These are the call sites that I think should be fixed with it: > > > > *** fs/proc/task_mmu.c: > > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > > *** mm/memcontrol.c: > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd)) > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > > *** mm/memory-failure.c: > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > > *** mm/mempolicy.c: > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > > *** mm/madvise.c: > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > > > IIUC most of them may or may not be a big issue even without a retry, > > either because they're already not strict (smaps, pte_stats, MADV_COLD, > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to > > cold worst case), but some of them could have functional error without the > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over > > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > > > While these call sites all look fine, and don't need any change: > > > > *** include/linux/pgtable.h: > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > > *** mm/gup.c: > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > > *** mm/mapping_dirty_helpers.c: > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > > *** mm/memory.c: > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd))) > > *** mm/migrate_device.c: > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))) > > *** mm/mincore.c: > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > --- > > fs/proc/task_mmu.c | 17 +++++++++++++---- > > mm/madvise.c | 8 ++++++-- > > mm/memcontrol.c | 8 ++++++-- > > mm/memory-failure.c | 4 +++- > > mm/mempolicy.c | 4 +++- > > 5 files changed, 31 insertions(+), 10 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 6259dd432eeb..823eaba5c6bf 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > > goto out; > > } > > > > - if (pmd_trans_unstable(pmd)) > > + if (pmd_trans_unstable(pmd)) { > > + walk->action = ACTION_AGAIN; > > goto out; > > + } > > + > > /* > > * The mmap_lock held all the way back in m_start() is what > > * keeps khugepaged out of here and from collapsing things > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, > > return 0; > > } > > > > - if (pmd_trans_unstable(pmd)) > > + if (pmd_trans_unstable(pmd)) { > > + walk->action = ACTION_AGAIN; > > return 0; > > + } > > > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > for (; addr != end; pte++, addr += PAGE_SIZE) { > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > > return err; > > } > > > > - if (pmd_trans_unstable(pmdp)) > > + if (pmd_trans_unstable(pmdp)) { > > + walk->action = ACTION_AGAIN; > > return 0; > > Had a quick look at the pagemap code, I agree with your analysis, > "returning 0" may mess up pagemap, retry should be fine. But I'm > wondering whether we should just fill in empty entries. Anyway I don't > have a strong opinion on this, just a little bit concerned by > potential indefinite retry. Yes, none pte is still an option. But if we're going to fix this anyway, it seems better to fix it with the accurate new thing that poped up, and it's even less change (just apply walk->action rather than doing random stuff in different call sites). I see that you have worry on deadloop over this, so I hope to discuss altogether here. Unlike normal checks, pmd_trans_unstable() check means something must have changed in the past very short period or it should just never if nothing changed concurrently from under us, so it's not a "if (flag==true)" check which is even more likely to loop. If we see the places that I didn't touch, most of them suggested a retry in one form or another. So if there's a worry this will also not the first time to do a retry (and for such a "unstable" API, that's really the most natural thing to do which is to retry until it's stable). So in general, it seems to me if we deadloop over pmd_trans_unstable() for whatever reason then something more wrong could have happened.. Thanks, -- Peter Xu