On Tue, Aug 8, 2023 at 11:16 PM Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> wrote: > > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have Async Write-Protection enabled > (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file > mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped > (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``). > - Find pages which have been written to and/or write protect > (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages > atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched > pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if > non-Async-Write-Protected pages are found. Get is automatically performed > if output buffer is specified. > > This IOCTL can be extended to get information about more PTE bits. The > entire address range passed by user [start, end) is scanned until either > the user provided buffer is full or max_pages have been found. > > Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx> > Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> > Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > --- > Changes in v28: > - Fix walk_end one last time after doing through testing > > Changes in v27: > - Add PAGE_IS_HUGE > - Iterate until temporary buffer is full to do less iterations > - Don't check if PAGE_IS_FILE if no mask needs it as it is very > expensive to check per pte > - bring is_interesting_page() outside pagemap_scan_output() to remove > the horrible return value check > - Replace memcpy() with direct copy > - rename end_addr to walk_end_addr in pagemap_scan_private > - Abort walk if fatal_signal_pending() > > Changes in v26: > Changes made by Usama: > - Fix the wrong breaking of loop if page isn't interesting, skip intsead > - Untag the address and save them into struct > - Round off the end address to next page > - Correct the partial hugetlb page handling and returning the error > - Rename PAGE_IS_WPASYNC to PAGE_IS_WPALLOWED > - Return walk ending address in walk_end instead of returning in start > as there is potential of replacing the memory tag > > Changes by Michał: > 1. the API: > a. return ranges as {begin, end} instead of {begin + len}; > b. rename match "flags" to 'page categories' everywhere - this makes > it easier to differentiate the ioctl()s categorisation of pages > from struct page flags; > c. change {required + excluded} to {inverted + required}. This was > rejected before, but I'd like to illustrate the difference. > Old interface can be translated to the new by: > categories_inverted = excluded_mask > categories_mask = required_mask | excluded_mask > categories_anyof_mask = anyof_mask > The new way allows filtering by: A & (B | !C) > categories_inverted = C > categories_mask = A > categories_anyof_mask = B | C > e. allow no-op calls > 2. the implementation: > a. gather the page-categorising and write-protecting code in one place; > b. optimization: add whole-vma skipping for WP usecase; > c. extracted output limiting code to pagemap_scan_output(); > d. extracted range coalescing to pagemap_scan_push_range(); > e. extracted THP entry handling to pagemap_scan_thp_entry(); > f. added a shortcut for non-WP hugetlb scan; avoids conditional > locking; > g. extracted scan buffer handling code out of do_pagemap_scan(); > h. rework output code to always try to write pending ranges; if EFAULT > is generated it always overwrites the original error code; > (the case of SIGKILL is needlessly trying to write the output > now, but this should be rare case and ignoring it makes the code > not needing a goto) > 3.Change no-GET operation condition from `arg.return_mask == 0` to > `arg.vec == NULL`. This will allow issuing the ioctl with > return_mask == 0 to gather matching ranges when the exact category > is not interesting. (Anticipated for CRIU scanning a large sparse > anonymous mapping). > > Changes in v25: > - Do proper filtering on hole as well (hole got missed earlier) > > Changes in v24: > - Place WP markers in case of hole as well > > Changes in v23: > - Set vec_buf_index to 0 only when vec_buf_index is set > - Return -EFAULT instead of -EINVAL if vec is NULL > - Correctly return the walk ending address to the page granularity > > Changes in v22: > - Interface change to return walk ending address to user: > - Replace [start start + len) with [start, end) > - Return the ending address of the address walk in start > > Changes in v21: > - Abort walk instead of returning error if WP is to be performed on > partial hugetlb > - Changed the data types of some variables in pagemap_scan_private to > long > > Changes in v20: > - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO > > Changes in v19: > - Interface changes such as renaming, return mask and WP can be used > with any flags specified in masks > - Internal code changes > > Changes in v18: > - Rebased on top of next-20230613 > - ptep_get() updates > - remove pmd_trans_unstable() and add ACTION_AGAIN > - Review updates (Micheal) > > Changes in v17: > - Rebased on next-20230606 > - Made make_uffd_wp_*_pte() better and minor changes > > Changes in v16: > - Fixed a corner case where kernel writes beyond user buffer by one > element > - Bring back exclusive PM_SCAN_OP_WP > - Cosmetic changes > > Changes in v15: > - Build fix: > - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of > using x86 specific flush function in do_pagemap_scan() > - Remove #ifdef from pagemap_scan_hugetlb_entry() > - Use mm instead of undefined vma->vm_mm > > Changes in v14: > - Fix build error caused by #ifdef added at last minute in some configs > > Changes in v13: > - Review updates > - mmap_read_lock_killable() instead of mmap_read_lock() > - Replace uffd_wp_range() with helpers which increases performance > drastically for OP_WP operations by reducing the number of tlb > flushing etc > - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range > > Changes in v12: > - Add hugetlb support to cover all memory types > - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch > - Review updates to the code > > Changes in v11: > - Find written pages in a better way > - Fix a corner case (thanks Paul) > - Improve the code/comments > - remove ENGAGE_WP + ! GET operation > - shorten the commit message in favour of moving documentation to > pagemap.rst > > Changes in v10: > - move changes in tools/include/uapi/linux/fs.h to separate patch > - update commit message > > Change in v8: > - Correct is_pte_uffd_wp() > - Improve readability and error checks > - Remove some un-needed code > > Changes in v7: > - Rebase on top of latest next > - Fix some corner cases > - Base soft-dirty on the uffd wp async > - Update the terminologies > - Optimize the memory usage inside the ioctl > --- > fs/proc/task_mmu.c | 678 ++++++++++++++++++++++++++++++++++++++++ > include/linux/hugetlb.h | 1 + > include/uapi/linux/fs.h | 59 ++++ > mm/hugetlb.c | 2 +- > 4 files changed, 739 insertions(+), 1 deletion(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index c1e6531cb02ae..0e219a44e97cd 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,8 @@ > #include <linux/shmem_fs.h> > #include <linux/uaccess.h> > #include <linux/pkeys.h> > +#include <linux/minmax.h> > +#include <linux/overflow.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -1749,11 +1751,687 @@ static int pagemap_release(struct inode *inode, struct file *file) > return 0; > } > > +#define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \ > + PAGE_IS_FILE | PAGE_IS_PRESENT | \ > + PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ > + PAGE_IS_HUGE) > +#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) > + > +#define MASKS_OF_INTEREST(a) (a.category_inverted | a.category_mask | \ > + a.category_anyof_mask | a.return_mask) > + > +struct pagemap_scan_private { > + struct pm_scan_arg arg; > + unsigned long masks_of_interest, cur_vma_category; > + struct page_region *vec_buf, cur_buf; I think we can remove cur_buf. Imho, it makes code a bit more readable. Here is a quick poc patch: https://gist.github.com/avagin/2e465e7c362c515ec84d72a201a28de4 > + unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr; > + struct page_region __user *vec_out; > +}; ... > +#ifdef CONFIG_HUGETLB_PAGE > +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, > + unsigned long start, unsigned long end, > + struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long categories; > + spinlock_t *ptl; > + int ret = 0; > + pte_t pte; > + > + if (~p->arg.flags & PM_SCAN_WP_MATCHING) { > + /* Go the short route when not write-protecting pages. */ > + > + pte = huge_ptep_get(ptep); > + categories = p->cur_vma_category | pagemap_hugetlb_category(pte); > + > + if (!pagemap_scan_is_interesting_page(categories, p)) > + return 0; > + > + return pagemap_scan_output(categories, p, start, &end); > + } > + > + i_mmap_lock_write(vma->vm_file->f_mapping); > + ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep); > + > + pte = huge_ptep_get(ptep); > + categories = p->cur_vma_category | pagemap_hugetlb_category(pte); > + > + if (!pagemap_scan_is_interesting_page(categories, p)) > + goto out_unlock; > + > + ret = pagemap_scan_output(categories, p, start, &end); > + if (start == end) > + goto out_unlock; > + > + if (~categories & PAGE_IS_WRITTEN) > + goto out_unlock; > + > + if (end != start + HPAGE_SIZE) { > + /* Partial HugeTLB page WP isn't possible. */ > + pagemap_scan_backout_range(p, start, end, start); > + ret = -EINVAL; Will this error be returned from ioctl? If the answer is yet, it looks wrong to me. > + goto out_unlock; > + } > + > + make_uffd_wp_huge_pte(vma, start, ptep, pte); > + flush_hugetlb_tlb_range(vma, start, end); > + > +out_unlock: > + spin_unlock(ptl); > + i_mmap_unlock_write(vma->vm_file->f_mapping); > + > + return ret; > +} .... > +static int pagemap_scan_get_args(struct pm_scan_arg *arg, > + unsigned long uarg) > +{ > + if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg))) > + return -EFAULT; > + > + if (arg->size != sizeof(struct pm_scan_arg)) > + return -EINVAL; > + > + /* Validate requested features */ > + if (arg->flags & ~PM_SCAN_FLAGS) > + return -EINVAL; > + if ((arg->category_inverted | arg->category_mask | > + arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES) > + return -EINVAL; > + > + arg->start = untagged_addr((unsigned long)arg->start); > + arg->end = untagged_addr((unsigned long)arg->end); > + arg->vec = untagged_addr((unsigned long)arg->vec); > + > + /* Validate memory pointers */ > + if (!IS_ALIGNED(arg->start, PAGE_SIZE)) > + return -EINVAL; > + if (!access_ok((void __user *)arg->start, arg->end - arg->start)) > + return -EFAULT; > + if (!arg->vec && arg->vec_len) > + return -EFAULT; It looks more like EINVAL. > + if (arg->vec && !access_ok((void __user *)arg->vec, > + arg->vec_len * sizeof(struct page_region))) > + return -EFAULT; > + > + /* Fixup default values */ > + arg->end = ALIGN(arg->end, PAGE_SIZE); > + if (!arg->max_pages) > + arg->max_pages = ULONG_MAX; > + > + return 0; > +} > + Thanks, Andrei