On Wed, 9 Aug 2023 at 08:16, Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have Async Write-Protection enabled > (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file > mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped > (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``). > - Find pages which have been written to and/or write protect > (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages > atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched > pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if > non-Async-Write-Protected pages are found. Get is automatically performed > if output buffer is specified. > > This IOCTL can be extended to get information about more PTE bits. The > entire address range passed by user [start, end) is scanned until either > the user provided buffer is full or max_pages have been found. > > Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx> > Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> Still applies, thanks! Please find some mostly-polishing comments below. > Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > --- > Changes in v28: > - Fix walk_end one last time after doing through testing > > Changes in v27: > - Add PAGE_IS_HUGE It seems to be missing from the commitmsg, though. But maybe listing all the flags there is redundant, since a doc is coming anyway and the values are listed in the .h? [...] > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c [...] > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static unsigned long pagemap_thp_category(pmd_t pmd) > +{ > + unsigned long categories = PAGE_IS_HUGE; > + > + if (pmd_present(pmd)) { > + categories |= PAGE_IS_PRESENT; > + if (!pmd_uffd_wp(pmd)) > + categories |= PAGE_IS_WRITTEN; > + if (is_zero_pfn(pmd_pfn(pmd))) > + categories |= PAGE_IS_PFNZERO; > + } else if (is_swap_pmd(pmd)) { > + categories |= PAGE_IS_SWAPPED; > + if (!pmd_swp_uffd_wp(pmd)) > + categories |= PAGE_IS_WRITTEN; > + } > + > + return categories; > +} I guess THPs can't be file-backed currently, but can we somehow mark this assumption so it can be easily found if the capability arrives? > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > +#ifdef CONFIG_HUGETLB_PAGE > +static unsigned long pagemap_hugetlb_category(pte_t pte) > +{ > + unsigned long categories = PAGE_IS_HUGE; > + > + if (pte_present(pte)) { > + categories |= PAGE_IS_PRESENT; > + if (!huge_pte_uffd_wp(pte)) > + categories |= PAGE_IS_WRITTEN; > + if (!PageAnon(pte_page(pte))) > + categories |= PAGE_IS_FILE; > + if (is_zero_pfn(pte_pfn(pte))) > + categories |= PAGE_IS_PFNZERO; > + } else if (is_swap_pte(pte)) { > + categories |= PAGE_IS_SWAPPED; > + if (!pte_swp_uffd_wp_any(pte)) > + categories |= PAGE_IS_WRITTEN; > + } BTW, can a HugeTLB page be file-backed and swapped out? > +static void pagemap_scan_backout_range(struct pagemap_scan_private *p, > + unsigned long addr, unsigned long end, > + unsigned long walk_end_addr) > +{ > + struct page_region *cur_buf = &p->cur_buf; > + > + if (cur_buf->start != addr) > + cur_buf->end = addr; > + else > + cur_buf->start = cur_buf->end = 0; > + > + p->walk_end_addr = walk_end_addr; > + p->found_pages -= (end - addr) / PAGE_SIZE; > +} When would `walk_end_addr` be different from `end`? Maybe it would be easier to understand if the `p->walk_end_addr` update was pushed to the callers? (Actually: the one that wants to change it.) > +static int pagemap_scan_output(unsigned long categories, > + struct pagemap_scan_private *p, > + unsigned long addr, unsigned long *end) > +{ > + unsigned long n_pages, total_pages; > + int ret = 0; > + > + if (!p->vec_buf) > + return 0; > + > + categories &= p->arg.return_mask; > + > + n_pages = (*end - addr) / PAGE_SIZE; > + if (check_add_overflow(p->found_pages, n_pages, &total_pages) || //TODO > + total_pages > p->arg.max_pages) { Re: TODO: Is there anything left to change here? > + size_t n_too_much = total_pages - p->arg.max_pages; > + *end -= n_too_much * PAGE_SIZE; > + n_pages -= n_too_much; > + ret = -ENOSPC; > + } [...] > +static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long categories; > + spinlock_t *ptl; > + int ret = 0; > + > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (!ptl) > + return -ENOENT; > + > + categories = p->cur_vma_category | pagemap_thp_category(*pmd); > + > + if (!pagemap_scan_is_interesting_page(categories, p)) > + goto out_unlock; > + > + ret = pagemap_scan_output(categories, p, start, &end); > + if (start == end) > + goto out_unlock; > + > + if (~p->arg.flags & PM_SCAN_WP_MATCHING) > + goto out_unlock; > + if (~categories & PAGE_IS_WRITTEN) > + goto out_unlock; > + > + /* > + * Break huge page into small pages if the WP operation > + * need to be performed is on a portion of the huge page. "needs to be performed on ..." > + */ > + if (end != start + HPAGE_SIZE) { > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, start); > + pagemap_scan_backout_range(p, start, end, 0); > + /* Indicate to caller for processing these as normal pages */ Nit: "Report as if there was no THP." ? > +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + pte_t *pte, *start_pte; > + unsigned long addr; > + bool flush = false; > + spinlock_t *ptl; > + int ret; > + > + arch_enter_lazy_mmu_mode(); > + > + ret = pagemap_scan_thp_entry(pmd, start, end, walk); > + if (ret != -ENOENT) { > + arch_leave_lazy_mmu_mode(); > + return ret; > + } > + > + ret = 0; > + start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > + if (!pte) { > + arch_leave_lazy_mmu_mode(); > + walk->action = ACTION_AGAIN; > + return 0; > + } > + > + for (addr = start; addr != end; pte++, addr += PAGE_SIZE) { > + unsigned long categories = p->cur_vma_category | > + pagemap_page_category(p, vma, addr, ptep_get(pte)); > + unsigned long next = addr + PAGE_SIZE; > + > + if (!pagemap_scan_is_interesting_page(categories, p)) > + continue; > + > + ret = pagemap_scan_output(categories, p, addr, &next); > + if (next == addr) > + break; > + > + if (~p->arg.flags & PM_SCAN_WP_MATCHING) > + continue; > + if (~categories & PAGE_IS_WRITTEN) > + continue; > + > + make_uffd_wp_pte(vma, addr, pte); > + if (!flush) { > + start = addr; > + flush = true; > + } A quick improvement: /* instead of `flush` declaration */ unsigned long flush_end = 0; if (!flush_end) start = addr; flush_end = next; > + } > + > + if (flush) > + flush_tlb_range(vma, start, addr); And here: if (flush_end) flush_tlb_range(vma, start, flush_end); > + pte_unmap_unlock(start_pte, ptl); > + arch_leave_lazy_mmu_mode(); > + > + cond_resched(); > + return ret; > +} [...] > +static long do_pagemap_scan(struct mm_struct *mm, unsigned long uarg) > +{ > + struct mmu_notifier_range range; > + struct pagemap_scan_private p; > + unsigned long walk_start; > + size_t n_ranges_out = 0; > + int ret; > + > + memset(&p, 0, sizeof(p)); > + ret = pagemap_scan_get_args(&p.arg, uarg); > + if (ret) > + return ret; > + > + p.masks_of_interest = MASKS_OF_INTEREST(p.arg); Please inline the macro as here is the only use of it. [...] > + walk_start = p.arg.start; > + for (; walk_start < p.arg.end; walk_start = p.arg.walk_end) { Nit: the initialization statement can now be part of the for(). > + int n_out; > + > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > + ret = mmap_read_lock_killable(mm); > + if (ret) > + break; > + ret = walk_page_range(mm, walk_start, p.arg.end, > + &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + > + n_out = pagemap_scan_flush_buffer(&p); > + if (n_out < 0) > + ret = n_out; > + else > + n_ranges_out += n_out; > + > + if (!ret) > + p.walk_end_addr = p.arg.end; > + > + if (ret != -ENOSPC || p.arg.vec_len - 1 == 0 || > + p.found_pages == p.arg.max_pages) > + break; The second condition is equivalent to `p.arg.vec_len == 1`, but why is that an ending condition? Isn't the last entry enough to gather one more range? (The walk could have returned -ENOSPC due to buffer full and after flushing it could continue with the last free entry.) [...] > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -259,6 +259,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > unsigned long cp_flags); > > bool is_hugetlb_entry_migration(pte_t pte); > +bool is_hugetlb_entry_hwpoisoned(pte_t pte); > void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); > > #else /* !CONFIG_HUGETLB_PAGE */ > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index b7b56871029c5..1c9d38af1015e 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -305,4 +305,63 @@ typedef int __bitwise __kernel_rwf_t; > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > RWF_APPEND) > > +/* Pagemap ioctl */ > +#define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg) > + > +/* Bits are set in flags of the page_region and masks in pm_scan_args */ "Bitmasks provided in pm_scan_args masks and reported in page_region.categories." > +#define PAGE_IS_WPALLOWED (1 << 0) > +#define PAGE_IS_WRITTEN (1 << 1) > +#define PAGE_IS_FILE (1 << 2) > +#define PAGE_IS_PRESENT (1 << 3) > +#define PAGE_IS_SWAPPED (1 << 4) > +#define PAGE_IS_PFNZERO (1 << 5) > +#define PAGE_IS_HUGE (1 << 6) > + > +/* > + * struct page_region - Page region with flags > + * @start: Start of the region > + * @end: End of the region (exclusive) > + * @categories: PAGE_IS_* category bitmask for the region > + */ > +struct page_region { > + __u64 start; > + __u64 end; > + __u64 categories; > +}; > + > +/* Flags for PAGEMAP_SCAN ioctl */ > +#define PM_SCAN_WP_MATCHING (1 << 0) /* Write protect the pages matched. */ > +#define PM_SCAN_CHECK_WPASYNC (1 << 1) /* Abort the scan when a non-WP-enabled page is found. */ > + > +/* > + * struct pm_scan_arg - Pagemap ioctl argument > + * @size: Size of the structure > + * @flags: Flags for the IOCTL > + * @start: Starting address of the region > + * @end: Ending address of the region > + * @walk_end Address where the scan stopped (written by kernel). > + * walk_end == end informs that the scan completed on entire range. Can we ensure this holds also for the tagged pointers? > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional limit for number of returned pages (0 = disabled) > + * @category_inverted: PAGE_IS_* categories which values match if 0 instead of 1 > + * @category_mask: Skip pages for which any category doesn't match > + * @category_anyof_mask: Skip pages for which no category matches > + * @return_mask: PAGE_IS_* categories that are to be reported in `page_region`s returned > + */ [...] Best Regards Michał Mirosław