On Tue, 2019-04-16 at 10:46 -0400, Jerome Glisse wrote: > On Sat, Apr 13, 2019 at 08:34:02AM +0000, Thomas Hellstrom wrote: > > Hi, Jérôme > > > > On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote: > > > On Fri, Apr 12, 2019 at 04:04:18PM +0000, Thomas Hellstrom wrote: > > > > This is basically apply_to_page_range with added functionality: > > > > Allocating missing parts of the page table becomes optional, > > > > which > > > > means that the function can be guaranteed not to error if > > > > allocation > > > > is disabled. Also passing of the closure struct and callback > > > > function > > > > becomes different and more in line with how things are done > > > > elsewhere. > > > > > > > > Finally we keep apply_to_page_range as a wrapper around > > > > apply_to_pfn_range > > > > > > > > The reason for not using the page-walk code is that we want to > > > > perform > > > > the page-walk on vmas pointing to an address space without > > > > requiring the > > > > mmap_sem to be held rather thand on vmas belonging to a process > > > > with the > > > > mmap_sem held. > > > > > > > > Notable changes since RFC: > > > > Don't export apply_to_pfn range. > > > > > > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > > > Cc: Will Deacon <will.deacon@xxxxxxx> > > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > > Cc: Rik van Riel <riel@xxxxxxxxxxx> > > > > Cc: Minchan Kim <minchan@xxxxxxxxxx> > > > > Cc: Michal Hocko <mhocko@xxxxxxxx> > > > > Cc: Huang Ying <ying.huang@xxxxxxxxx> > > > > Cc: Souptick Joarder <jrdr.linux@xxxxxxxxx> > > > > Cc: "Jérôme Glisse" <jglisse@xxxxxxxxxx> > > > > Cc: linux-mm@xxxxxxxxx > > > > Cc: linux-kernel@xxxxxxxxxxxxxxx > > > > Signed-off-by: Thomas Hellstrom <thellstrom@xxxxxxxxxx> > > > > --- > > > > include/linux/mm.h | 10 ++++ > > > > mm/memory.c | 130 ++++++++++++++++++++++++++++++++++--- > > > > ---- > > > > ---- > > > > 2 files changed, 108 insertions(+), 32 deletions(-) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 80bb6408fe73..b7dd4ddd6efb 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte, > > > > pgtable_t token, unsigned long addr, > > > > extern int apply_to_page_range(struct mm_struct *mm, unsigned > > > > long > > > > address, > > > > unsigned long size, pte_fn_t fn, > > > > void > > > > *data); > > > > > > > > +struct pfn_range_apply; > > > > +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned > > > > long addr, > > > > + struct pfn_range_apply *closure); > > > > +struct pfn_range_apply { > > > > + struct mm_struct *mm; > > > > + pter_fn_t ptefn; > > > > + unsigned int alloc; > > > > +}; > > > > +extern int apply_to_pfn_range(struct pfn_range_apply *closure, > > > > + unsigned long address, unsigned > > > > long > > > > size); > > > > > > > > #ifdef CONFIG_PAGE_POISONING > > > > extern bool page_poisoning_enabled(void); > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index a95b4a3b1ae2..60d67158964f 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct > > > > vm_area_struct > > > > *vma, phys_addr_t start, unsigned long > > > > } > > > > EXPORT_SYMBOL(vm_iomap_memory); > > > > > > > > -static int apply_to_pte_range(struct mm_struct *mm, pmd_t > > > > *pmd, > > > > - unsigned long addr, > > > > unsigned long > > > > end, > > > > - pte_fn_t fn, void *data) > > > > +static int apply_to_pte_range(struct pfn_range_apply *closure, > > > > pmd_t *pmd, > > > > + unsigned long addr, unsigned long > > > > end) > > > > { > > > > pte_t *pte; > > > > int err; > > > > pgtable_t token; > > > > spinlock_t *uninitialized_var(ptl); > > > > > > > > - pte = (mm == &init_mm) ? > > > > + pte = (closure->mm == &init_mm) ? > > > > pte_alloc_kernel(pmd, addr) : > > > > - pte_alloc_map_lock(mm, pmd, addr, &ptl); > > > > + pte_alloc_map_lock(closure->mm, pmd, addr, > > > > &ptl); > > > > if (!pte) > > > > return -ENOMEM; > > > > > > > > @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct > > > > mm_struct *mm, pmd_t *pmd, > > > > token = pmd_pgtable(*pmd); > > > > > > > > do { > > > > - err = fn(pte++, token, addr, data); > > > > + err = closure->ptefn(pte++, token, addr, > > > > closure); > > > > if (err) > > > > break; > > > > } while (addr += PAGE_SIZE, addr != end); > > > > > > > > arch_leave_lazy_mmu_mode(); > > > > > > > > - if (mm != &init_mm) > > > > + if (closure->mm != &init_mm) > > > > pte_unmap_unlock(pte-1, ptl); > > > > return err; > > > > } > > > > > > > > -static int apply_to_pmd_range(struct mm_struct *mm, pud_t > > > > *pud, > > > > - unsigned long addr, > > > > unsigned long > > > > end, > > > > - pte_fn_t fn, void *data) > > > > +static int apply_to_pmd_range(struct pfn_range_apply *closure, > > > > pud_t *pud, > > > > + unsigned long addr, unsigned long > > > > end) > > > > { > > > > pmd_t *pmd; > > > > unsigned long next; > > > > - int err; > > > > + int err = 0; > > > > > > > > BUG_ON(pud_huge(*pud)); > > > > > > > > - pmd = pmd_alloc(mm, pud, addr); > > > > + pmd = pmd_alloc(closure->mm, pud, addr); > > > > if (!pmd) > > > > return -ENOMEM; > > > > + > > > > do { > > > > next = pmd_addr_end(addr, end); > > > > - err = apply_to_pte_range(mm, pmd, addr, next, > > > > fn, > > > > data); > > > > + if (!closure->alloc && > > > > pmd_none_or_clear_bad(pmd)) > > > > + continue; > > > > + err = apply_to_pte_range(closure, pmd, addr, > > > > next); > > > > if (err) > > > > break; > > > > } while (pmd++, addr = next, addr != end); > > > > return err; > > > > } > > > > > > > > -static int apply_to_pud_range(struct mm_struct *mm, p4d_t > > > > *p4d, > > > > - unsigned long addr, > > > > unsigned long > > > > end, > > > > - pte_fn_t fn, void *data) > > > > +static int apply_to_pud_range(struct pfn_range_apply *closure, > > > > p4d_t *p4d, > > > > + unsigned long addr, unsigned long > > > > end) > > > > { > > > > pud_t *pud; > > > > unsigned long next; > > > > - int err; > > > > + int err = 0; > > > > > > > > - pud = pud_alloc(mm, p4d, addr); > > > > + pud = pud_alloc(closure->mm, p4d, addr); > > > > if (!pud) > > > > return -ENOMEM; > > > > + > > > > do { > > > > next = pud_addr_end(addr, end); > > > > - err = apply_to_pmd_range(mm, pud, addr, next, > > > > fn, > > > > data); > > > > + if (!closure->alloc && > > > > pud_none_or_clear_bad(pud)) > > > > + continue; > > > > + err = apply_to_pmd_range(closure, pud, addr, > > > > next); > > > > if (err) > > > > break; > > > > } while (pud++, addr = next, addr != end); > > > > return err; > > > > } > > > > > > > > -static int apply_to_p4d_range(struct mm_struct *mm, pgd_t > > > > *pgd, > > > > - unsigned long addr, > > > > unsigned long > > > > end, > > > > - pte_fn_t fn, void *data) > > > > +static int apply_to_p4d_range(struct pfn_range_apply *closure, > > > > pgd_t *pgd, > > > > + unsigned long addr, unsigned long > > > > end) > > > > { > > > > p4d_t *p4d; > > > > unsigned long next; > > > > - int err; > > > > + int err = 0; > > > > > > > > - p4d = p4d_alloc(mm, pgd, addr); > > > > + p4d = p4d_alloc(closure->mm, pgd, addr); > > > > if (!p4d) > > > > return -ENOMEM; > > > > + > > > > do { > > > > next = p4d_addr_end(addr, end); > > > > - err = apply_to_pud_range(mm, p4d, addr, next, > > > > fn, > > > > data); > > > > + if (!closure->alloc && > > > > p4d_none_or_clear_bad(p4d)) > > > > + continue; > > > > + err = apply_to_pud_range(closure, p4d, addr, > > > > next); > > > > if (err) > > > > break; > > > > } while (p4d++, addr = next, addr != end); > > > > return err; > > > > } > > > > > > > > -/* > > > > - * Scan a region of virtual memory, filling in page tables as > > > > necessary > > > > - * and calling a provided function on each leaf page table. > > > > +/** > > > > + * apply_to_pfn_range - Scan a region of virtual memory, > > > > calling a > > > > provided > > > > + * function on each leaf page table entry > > > > + * @closure: Details about how to scan and what function to > > > > apply > > > > + * @addr: Start virtual address > > > > + * @size: Size of the region > > > > + * > > > > + * If @closure->alloc is set to 1, the function will fill in > > > > the > > > > page table > > > > + * as necessary. Otherwise it will skip non-present parts. > > > > + * Note: The caller must ensure that the range does not > > > > contain > > > > huge pages. > > > > + * The caller must also assure that the proper mmu_notifier > > > > functions are > > > > + * called. Either in the pte leaf function or before and after > > > > the > > > > call to > > > > + * apply_to_pfn_range. > > > > > > This is wrong there should be a big FAT warning that this can > > > only be > > > use > > > against mmap of device file. The page table walking above is > > > broken > > > for > > > various thing you might find in any other vma like THP, device > > > pte, > > > hugetlbfs, > > > > I was figuring since we didn't export the function anymore, the > > warning > > and checks could be left to its users, assuming that any other > > future > > usage of this function would require mm people audit anyway. But I > > can > > of course add that warning also to this function if you still want > > that? > > Yeah more warning are better, people might start using this, i know > some poeple use unexported symbol and then report bugs while they > just were doing something illegal. > > > > ... > > > > > > Also the mmu notifier can not be call from the pfn callback as > > > that > > > callback > > > happens under page table lock (the change_pte notifier callback > > > is > > > useless > > > and not enough). So it _must_ happen around the call to > > > apply_to_pfn_range > > > > In the comments I was having in mind usage of, for example > > ptep_clear_flush_notify(). But you're the mmu_notifier expert here. > > Are > > you saying that function by itself would not be sufficient? > > In that case, should I just scratch the text mentioning the pte > > leaf > > function? > > ptep_clear_flush_notify() is useless ... i have posted patches to > either > restore it or remove it. In any case you must call mmu notifier range > and > they can not happen under lock. You usage looked fine (in the next > patch) > but i would rather have a bit of comment here to make sure people are > also > aware of that. > > While we can hope that people would cc mm when using mm function, it > is > not always the case. So i rather be cautious and warn in comment as > much > as possible. > OK. Understood. All this actually makes me tend to want to try a bit harder using a slight modification to the pagewalk code instead. Don't really want to encourage two parallel code paths doing essentially the same thing; one good and one bad. One thing that confuses me a bit with the pagewalk code is that callers (for example softdirty) typically call mmu_notifier_invalidate_range_start() around the pagewalk, but then if it ends up splitting a pmd, mmu_notifier_invalidate_range is called again, within the first range. Docs aren't really clear whether that's permitted or not. Is it? Thanks, Thomas > Cheers, > Jérôme