On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote: > > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > For most of the page walk paths, logically it'll always be good to have the > > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none > > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not > > > even treating that as a none pmd. If to fix it anyway, a retry will be the > > > most accurate. > > > > > > I've went over all the pmd_trans_unstable() special cases and this patch > > > should cover all the rest places where we should retry properly with > > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can > > > easily achieve that. > > > > > > These are the call sites that I think should be fixed with it: > > > > > > *** fs/proc/task_mmu.c: > > > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > > > *** mm/memcontrol.c: > > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd)) > > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > > > *** mm/memory-failure.c: > > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > > > *** mm/mempolicy.c: > > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > > > *** mm/madvise.c: > > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > > > > > IIUC most of them may or may not be a big issue even without a retry, > > > either because they're already not strict (smaps, pte_stats, MADV_COLD, > > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to > > > cold worst case), but some of them could have functional error without the > > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over > > > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > > > > > While these call sites all look fine, and don't need any change: > > > > > > *** include/linux/pgtable.h: > > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > > > *** mm/gup.c: > > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > > > *** mm/mapping_dirty_helpers.c: > > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > > > *** mm/memory.c: > > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd))) > > > *** mm/migrate_device.c: > > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))) > > > *** mm/mincore.c: > > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > --- > > > fs/proc/task_mmu.c | 17 +++++++++++++---- > > > mm/madvise.c | 8 ++++++-- > > > mm/memcontrol.c | 8 ++++++-- > > > mm/memory-failure.c | 4 +++- > > > mm/mempolicy.c | 4 +++- > > > 5 files changed, 31 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 6259dd432eeb..823eaba5c6bf 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > > > goto out; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action = ACTION_AGAIN; > > > goto out; > > > + } > > > + > > > /* > > > * The mmap_lock held all the way back in m_start() is what > > > * keeps khugepaged out of here and from collapsing things > > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, > > > return 0; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action = ACTION_AGAIN; > > > return 0; > > > + } > > > > > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > for (; addr != end; pte++, addr += PAGE_SIZE) { > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > > > return err; > > > } > > > > > > - if (pmd_trans_unstable(pmdp)) > > > + if (pmd_trans_unstable(pmdp)) { > > > + walk->action = ACTION_AGAIN; > > > return 0; > > > > Had a quick look at the pagemap code, I agree with your analysis, > > "returning 0" may mess up pagemap, retry should be fine. But I'm > > wondering whether we should just fill in empty entries. Anyway I don't > > have a strong opinion on this, just a little bit concerned by > > potential indefinite retry. > > Yes, none pte is still an option. But if we're going to fix this anyway, > it seems better to fix it with the accurate new thing that poped up, and > it's even less change (just apply walk->action rather than doing random > stuff in different call sites). > > I see that you have worry on deadloop over this, so I hope to discuss > altogether here. > > Unlike normal checks, pmd_trans_unstable() check means something must have > changed in the past very short period or it should just never if nothing > changed concurrently from under us, so it's not a "if (flag==true)" check > which is even more likely to loop. > > If we see the places that I didn't touch, most of them suggested a retry in > one form or another. So if there's a worry this will also not the first > time to do a retry (and for such a "unstable" API, that's really the most > natural thing to do which is to retry until it's stable). IIUC other than do_anonymous_page() suggests retry (retry page fault), others may not, for example: - follow_pmd_mask: return -EBUSY - wp_clean_pmd_entry: actually just retry for pmd_none case, but the pagewalk code does handle pmd_none by skipping it, so it basically just retry once - min_core_pte_range: treated as unmapped range by calling __mincore_unmapped_range Anyway I really don't have a strong opinion on this. I may be just over-concerned. I just thought if nobody cares whether the result is accurate or not, why do we bother fixing those cases? > > So in general, it seems to me if we deadloop over pmd_trans_unstable() for > whatever reason then something more wrong could have happened.. > > Thanks, > > -- > Peter Xu >