On 8/11/23 12:07 AM, Andrei Vagin wrote: > On Tue, Aug 8, 2023 at 11:16 PM Muhammad Usama Anjum > <usama.anjum@xxxxxxxxxxxxx> wrote: >> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have Async Write-Protection enabled >> (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file >> mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped >> (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``). >> - Find pages which have been written to and/or write protect >> (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages >> atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched >> pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if >> non-Async-Write-Protected pages are found. Get is automatically performed >> if output buffer is specified. >> >> This IOCTL can be extended to get information about more PTE bits. The >> entire address range passed by user [start, end) is scanned until either >> the user provided buffer is full or max_pages have been found. >> >> Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx> >> Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> >> Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> >> --- >> Changes in v28: >> - Fix walk_end one last time after doing through testing >> >> Changes in v27: >> - Add PAGE_IS_HUGE >> - Iterate until temporary buffer is full to do less iterations >> - Don't check if PAGE_IS_FILE if no mask needs it as it is very >> expensive to check per pte >> - bring is_interesting_page() outside pagemap_scan_output() to remove >> the horrible return value check >> - Replace memcpy() with direct copy >> - rename end_addr to walk_end_addr in pagemap_scan_private >> - Abort walk if fatal_signal_pending() >> >> Changes in v26: >> Changes made by Usama: >> - Fix the wrong breaking of loop if page isn't interesting, skip intsead >> - Untag the address and save them into struct >> - Round off the end address to next page >> - Correct the partial hugetlb page handling and returning the error >> - Rename PAGE_IS_WPASYNC to PAGE_IS_WPALLOWED >> - Return walk ending address in walk_end instead of returning in start >> as there is potential of replacing the memory tag >> >> Changes by Michał: >> 1. the API: >> a. return ranges as {begin, end} instead of {begin + len}; >> b. rename match "flags" to 'page categories' everywhere - this makes >> it easier to differentiate the ioctl()s categorisation of pages >> from struct page flags; >> c. change {required + excluded} to {inverted + required}. This was >> rejected before, but I'd like to illustrate the difference. >> Old interface can be translated to the new by: >> categories_inverted = excluded_mask >> categories_mask = required_mask | excluded_mask >> categories_anyof_mask = anyof_mask >> The new way allows filtering by: A & (B | !C) >> categories_inverted = C >> categories_mask = A >> categories_anyof_mask = B | C >> e. allow no-op calls >> 2. the implementation: >> a. gather the page-categorising and write-protecting code in one place; >> b. optimization: add whole-vma skipping for WP usecase; >> c. extracted output limiting code to pagemap_scan_output(); >> d. extracted range coalescing to pagemap_scan_push_range(); >> e. extracted THP entry handling to pagemap_scan_thp_entry(); >> f. added a shortcut for non-WP hugetlb scan; avoids conditional >> locking; >> g. extracted scan buffer handling code out of do_pagemap_scan(); >> h. rework output code to always try to write pending ranges; if EFAULT >> is generated it always overwrites the original error code; >> (the case of SIGKILL is needlessly trying to write the output >> now, but this should be rare case and ignoring it makes the code >> not needing a goto) >> 3.Change no-GET operation condition from `arg.return_mask == 0` to >> `arg.vec == NULL`. This will allow issuing the ioctl with >> return_mask == 0 to gather matching ranges when the exact category >> is not interesting. (Anticipated for CRIU scanning a large sparse >> anonymous mapping). >> >> Changes in v25: >> - Do proper filtering on hole as well (hole got missed earlier) >> >> Changes in v24: >> - Place WP markers in case of hole as well >> >> Changes in v23: >> - Set vec_buf_index to 0 only when vec_buf_index is set >> - Return -EFAULT instead of -EINVAL if vec is NULL >> - Correctly return the walk ending address to the page granularity >> >> Changes in v22: >> - Interface change to return walk ending address to user: >> - Replace [start start + len) with [start, end) >> - Return the ending address of the address walk in start >> >> Changes in v21: >> - Abort walk instead of returning error if WP is to be performed on >> partial hugetlb >> - Changed the data types of some variables in pagemap_scan_private to >> long >> >> Changes in v20: >> - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO >> >> Changes in v19: >> - Interface changes such as renaming, return mask and WP can be used >> with any flags specified in masks >> - Internal code changes >> >> Changes in v18: >> - Rebased on top of next-20230613 >> - ptep_get() updates >> - remove pmd_trans_unstable() and add ACTION_AGAIN >> - Review updates (Micheal) >> >> Changes in v17: >> - Rebased on next-20230606 >> - Made make_uffd_wp_*_pte() better and minor changes >> >> Changes in v16: >> - Fixed a corner case where kernel writes beyond user buffer by one >> element >> - Bring back exclusive PM_SCAN_OP_WP >> - Cosmetic changes >> >> Changes in v15: >> - Build fix: >> - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of >> using x86 specific flush function in do_pagemap_scan() >> - Remove #ifdef from pagemap_scan_hugetlb_entry() >> - Use mm instead of undefined vma->vm_mm >> >> Changes in v14: >> - Fix build error caused by #ifdef added at last minute in some configs >> >> Changes in v13: >> - Review updates >> - mmap_read_lock_killable() instead of mmap_read_lock() >> - Replace uffd_wp_range() with helpers which increases performance >> drastically for OP_WP operations by reducing the number of tlb >> flushing etc >> - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range >> >> Changes in v12: >> - Add hugetlb support to cover all memory types >> - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch >> - Review updates to the code >> >> Changes in v11: >> - Find written pages in a better way >> - Fix a corner case (thanks Paul) >> - Improve the code/comments >> - remove ENGAGE_WP + ! GET operation >> - shorten the commit message in favour of moving documentation to >> pagemap.rst >> >> Changes in v10: >> - move changes in tools/include/uapi/linux/fs.h to separate patch >> - update commit message >> >> Change in v8: >> - Correct is_pte_uffd_wp() >> - Improve readability and error checks >> - Remove some un-needed code >> >> Changes in v7: >> - Rebase on top of latest next >> - Fix some corner cases >> - Base soft-dirty on the uffd wp async >> - Update the terminologies >> - Optimize the memory usage inside the ioctl >> --- >> fs/proc/task_mmu.c | 678 ++++++++++++++++++++++++++++++++++++++++ >> include/linux/hugetlb.h | 1 + >> include/uapi/linux/fs.h | 59 ++++ >> mm/hugetlb.c | 2 +- >> 4 files changed, 739 insertions(+), 1 deletion(-) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index c1e6531cb02ae..0e219a44e97cd 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,8 @@ >> #include <linux/shmem_fs.h> >> #include <linux/uaccess.h> >> #include <linux/pkeys.h> >> +#include <linux/minmax.h> >> +#include <linux/overflow.h> >> >> #include <asm/elf.h> >> #include <asm/tlb.h> >> @@ -1749,11 +1751,687 @@ static int pagemap_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +#define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \ >> + PAGE_IS_FILE | PAGE_IS_PRESENT | \ >> + PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ >> + PAGE_IS_HUGE) >> +#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) >> + >> +#define MASKS_OF_INTEREST(a) (a.category_inverted | a.category_mask | \ >> + a.category_anyof_mask | a.return_mask) >> + >> +struct pagemap_scan_private { >> + struct pm_scan_arg arg; >> + unsigned long masks_of_interest, cur_vma_category; >> + struct page_region *vec_buf, cur_buf; > > I think we can remove cur_buf. Imho, it makes code a bit more readable. > Here is a quick poc patch: > https://gist.github.com/avagin/2e465e7c362c515ec84d72a201a28de4 I thought ohhh how can this be removed initially. But considering that we have moved to walking full range until temporary buffer is full, removing cur_buf is possible. You have proved with your POC as well. Thank you for doing it. I've updated it after testing and simplified it further. > >> + unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr; >> + struct page_region __user *vec_out; >> +}; > > ... > >> +#ifdef CONFIG_HUGETLB_PAGE >> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, >> + unsigned long start, unsigned long end, >> + struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + unsigned long categories; >> + spinlock_t *ptl; >> + int ret = 0; >> + pte_t pte; >> + >> + if (~p->arg.flags & PM_SCAN_WP_MATCHING) { >> + /* Go the short route when not write-protecting pages. */ >> + >> + pte = huge_ptep_get(ptep); >> + categories = p->cur_vma_category | pagemap_hugetlb_category(pte); >> + >> + if (!pagemap_scan_is_interesting_page(categories, p)) >> + return 0; >> + >> + return pagemap_scan_output(categories, p, start, &end); >> + } >> + >> + i_mmap_lock_write(vma->vm_file->f_mapping); >> + ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep); >> + >> + pte = huge_ptep_get(ptep); >> + categories = p->cur_vma_category | pagemap_hugetlb_category(pte); >> + >> + if (!pagemap_scan_is_interesting_page(categories, p)) >> + goto out_unlock; >> + >> + ret = pagemap_scan_output(categories, p, start, &end); >> + if (start == end) >> + goto out_unlock; >> + >> + if (~categories & PAGE_IS_WRITTEN) >> + goto out_unlock; >> + >> + if (end != start + HPAGE_SIZE) { >> + /* Partial HugeTLB page WP isn't possible. */ >> + pagemap_scan_backout_range(p, start, end, start); >> + ret = -EINVAL; > > Will this error be returned from ioctl? If the answer is yet, it looks > wrong to me. Sorry, we missed it in previous revisions. I'll return 0 here and walk_end will indicate to user that we have not walked the entire range. > >> + goto out_unlock; >> + } >> + >> + make_uffd_wp_huge_pte(vma, start, ptep, pte); >> + flush_hugetlb_tlb_range(vma, start, end); >> + >> +out_unlock: >> + spin_unlock(ptl); >> + i_mmap_unlock_write(vma->vm_file->f_mapping); >> + >> + return ret; >> +} > > .... > >> +static int pagemap_scan_get_args(struct pm_scan_arg *arg, >> + unsigned long uarg) >> +{ >> + if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg))) >> + return -EFAULT; >> + >> + if (arg->size != sizeof(struct pm_scan_arg)) >> + return -EINVAL; >> + >> + /* Validate requested features */ >> + if (arg->flags & ~PM_SCAN_FLAGS) >> + return -EINVAL; >> + if ((arg->category_inverted | arg->category_mask | >> + arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES) >> + return -EINVAL; >> + >> + arg->start = untagged_addr((unsigned long)arg->start); >> + arg->end = untagged_addr((unsigned long)arg->end); >> + arg->vec = untagged_addr((unsigned long)arg->vec); >> + >> + /* Validate memory pointers */ >> + if (!IS_ALIGNED(arg->start, PAGE_SIZE)) >> + return -EINVAL; >> + if (!access_ok((void __user *)arg->start, arg->end - arg->start)) >> + return -EFAULT; >> + if (!arg->vec && arg->vec_len) >> + return -EFAULT; > > It looks more like EINVAL. Updated for next revision. > >> + if (arg->vec && !access_ok((void __user *)arg->vec, >> + arg->vec_len * sizeof(struct page_region))) >> + return -EFAULT; >> + >> + /* Fixup default values */ >> + arg->end = ALIGN(arg->end, PAGE_SIZE); >> + if (!arg->max_pages) >> + arg->max_pages = ULONG_MAX; >> + >> + return 0; >> +} >> + > > Thanks, > Andrei -- BR, Muhammad Usama Anjum