Re: [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 8, 2023 at 11:16 PM Muhammad Usama Anjum
<usama.anjum@xxxxxxxxxxxxx> wrote:
>
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have Async Write-Protection enabled
>   (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file
>   mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped
>   (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``).
> - Find pages which have been written to and/or write protect
>   (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages
>   atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched
>   pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if
>   non-Async-Write-Protected pages are found. Get is automatically performed
>   if output buffer is specified.
>
> This IOCTL can be extended to get information about more PTE bits. The
> entire address range passed by user [start, end) is scanned until either
> the user provided buffer is full or max_pages have been found.
>
> Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx>
> Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
> Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
> ---
> Changes in v28:
> - Fix walk_end one last time after doing through testing
>
> Changes in v27:
> - Add PAGE_IS_HUGE
> - Iterate until temporary buffer is full to do less iterations
> - Don't check if PAGE_IS_FILE if no mask needs it as it is very
>   expensive to check per pte
> - bring is_interesting_page() outside pagemap_scan_output() to remove
>   the horrible return value check
> - Replace memcpy() with direct copy
> - rename end_addr to walk_end_addr in pagemap_scan_private
> - Abort walk if fatal_signal_pending()
>
> Changes in v26:
> Changes made by Usama:
> - Fix the wrong breaking of loop if page isn't interesting, skip intsead
> - Untag the address and save them into struct
> - Round off the end address to next page
> - Correct the partial hugetlb page handling and returning the error
> - Rename PAGE_IS_WPASYNC to PAGE_IS_WPALLOWED
> - Return walk ending address in walk_end instead of returning in start
>   as there is potential of replacing the memory tag
>
> Changes by Michał:
> 1. the API:
>   a. return ranges as {begin, end} instead of {begin + len};
>   b. rename match "flags" to 'page categories' everywhere - this makes
>         it easier to differentiate the ioctl()s categorisation of pages
>         from struct page flags;
>   c. change {required + excluded} to {inverted + required}. This was
>         rejected before, but I'd like to illustrate the difference.
>         Old interface can be translated to the new by:
>                 categories_inverted = excluded_mask
>                 categories_mask = required_mask | excluded_mask
>                 categories_anyof_mask = anyof_mask
>         The new way allows filtering by: A & (B | !C)
>                 categories_inverted = C
>                 categories_mask = A
>                 categories_anyof_mask = B | C
>   e. allow no-op calls
> 2. the implementation:
>   a. gather the page-categorising and write-protecting code in one place;
>   b. optimization: add whole-vma skipping for WP usecase;
>   c. extracted output limiting code to pagemap_scan_output();
>   d. extracted range coalescing to pagemap_scan_push_range();
>   e. extracted THP entry handling to pagemap_scan_thp_entry();
>   f. added a shortcut for non-WP hugetlb scan; avoids conditional
>         locking;
>   g. extracted scan buffer handling code out of do_pagemap_scan();
>   h. rework output code to always try to write pending ranges; if EFAULT
>         is generated it always overwrites the original error code;
>         (the case of SIGKILL is needlessly trying to write the output
>         now, but this should be rare case and ignoring it makes the code
>         not needing a goto)
> 3.Change no-GET operation condition from `arg.return_mask == 0` to
>   `arg.vec == NULL`. This will allow issuing the ioctl with
>   return_mask == 0 to gather matching ranges when the exact category
>   is not interesting. (Anticipated for CRIU scanning a large sparse
>   anonymous mapping).
>
> Changes in v25:
> - Do proper filtering on hole as well (hole got missed earlier)
>
> Changes in v24:
> - Place WP markers in case of hole as well
>
> Changes in v23:
> - Set vec_buf_index to 0 only when vec_buf_index is set
> - Return -EFAULT instead of -EINVAL if vec is NULL
> - Correctly return the walk ending address to the page granularity
>
> Changes in v22:
> - Interface change to return walk ending address to user:
>   - Replace [start start + len) with [start, end)
>   - Return the ending address of the address walk in start
>
> Changes in v21:
> - Abort walk instead of returning error if WP is to be performed on
>   partial hugetlb
> - Changed the data types of some variables in pagemap_scan_private to
>   long
>
> Changes in v20:
> - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
>
> Changes in v19:
> - Interface changes such as renaming, return mask and WP can be used
>   with any flags specified in masks
> - Internal code changes
>
> Changes in v18:
> - Rebased on top of next-20230613
>   - ptep_get() updates
>   - remove pmd_trans_unstable() and add ACTION_AGAIN
> - Review updates (Micheal)
>
> Changes in v17:
> - Rebased on next-20230606
> - Made make_uffd_wp_*_pte() better and minor changes
>
> Changes in v16:
> - Fixed a corner case where kernel writes beyond user buffer by one
>   element
> - Bring back exclusive PM_SCAN_OP_WP
> - Cosmetic changes
>
> Changes in v15:
> - Build fix:
>   - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of
>     using x86 specific flush function in do_pagemap_scan()
>   - Remove #ifdef from pagemap_scan_hugetlb_entry()
>   - Use mm instead of undefined vma->vm_mm
>
> Changes in v14:
> - Fix build error caused by #ifdef added at last minute in some configs
>
> Changes in v13:
> - Review updates
> - mmap_read_lock_killable() instead of mmap_read_lock()
> - Replace uffd_wp_range() with helpers which increases performance
>   drastically for OP_WP operations by reducing the number of tlb
>   flushing etc
> - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range
>
> Changes in v12:
> - Add hugetlb support to cover all memory types
> - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
> - Review updates to the code
>
> Changes in v11:
> - Find written pages in a better way
> - Fix a corner case (thanks Paul)
> - Improve the code/comments
> - remove ENGAGE_WP + ! GET operation
> - shorten the commit message in favour of moving documentation to
>   pagemap.rst
>
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
> ---
>  fs/proc/task_mmu.c      | 678 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/hugetlb.h |   1 +
>  include/uapi/linux/fs.h |  59 ++++
>  mm/hugetlb.c            |   2 +-
>  4 files changed, 739 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c1e6531cb02ae..0e219a44e97cd 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,8 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <linux/minmax.h>
> +#include <linux/overflow.h>
>
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1749,11 +1751,687 @@ static int pagemap_release(struct inode *inode, struct file *file)
>         return 0;
>  }
>
> +#define PM_SCAN_CATEGORIES     (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN |  \
> +                                PAGE_IS_FILE | PAGE_IS_PRESENT |       \
> +                                PAGE_IS_SWAPPED | PAGE_IS_PFNZERO |    \
> +                                PAGE_IS_HUGE)
> +#define PM_SCAN_FLAGS          (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
> +
> +#define MASKS_OF_INTEREST(a)   (a.category_inverted | a.category_mask | \
> +                                a.category_anyof_mask | a.return_mask)
> +
> +struct pagemap_scan_private {
> +       struct pm_scan_arg arg;
> +       unsigned long masks_of_interest, cur_vma_category;
> +       struct page_region *vec_buf, cur_buf;

I think we can remove cur_buf. Imho, it makes code a bit more readable.
Here is a quick poc patch:
https://gist.github.com/avagin/2e465e7c362c515ec84d72a201a28de4

> +       unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr;
> +       struct page_region __user *vec_out;
> +};

...

> +#ifdef CONFIG_HUGETLB_PAGE
> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
> +                                     unsigned long start, unsigned long end,
> +                                     struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long categories;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t pte;
> +
> +       if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
> +               /* Go the short route when not write-protecting pages. */
> +
> +               pte = huge_ptep_get(ptep);
> +               categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
> +
> +               if (!pagemap_scan_is_interesting_page(categories, p))
> +                       return 0;
> +
> +               return pagemap_scan_output(categories, p, start, &end);
> +       }
> +
> +       i_mmap_lock_write(vma->vm_file->f_mapping);
> +       ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> +
> +       pte = huge_ptep_get(ptep);
> +       categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
> +
> +       if (!pagemap_scan_is_interesting_page(categories, p))
> +               goto out_unlock;
> +
> +       ret = pagemap_scan_output(categories, p, start, &end);
> +       if (start == end)
> +               goto out_unlock;
> +
> +       if (~categories & PAGE_IS_WRITTEN)
> +               goto out_unlock;
> +
> +       if (end != start + HPAGE_SIZE) {
> +               /* Partial HugeTLB page WP isn't possible. */
> +               pagemap_scan_backout_range(p, start, end, start);
> +               ret = -EINVAL;

Will this error be returned from ioctl? If the answer is yet, it looks
wrong to me.

> +               goto out_unlock;
> +       }
> +
> +       make_uffd_wp_huge_pte(vma, start, ptep, pte);
> +       flush_hugetlb_tlb_range(vma, start, end);
> +
> +out_unlock:
> +       spin_unlock(ptl);
> +       i_mmap_unlock_write(vma->vm_file->f_mapping);
> +
> +       return ret;
> +}

....

> +static int pagemap_scan_get_args(struct pm_scan_arg *arg,
> +                                unsigned long uarg)
> +{
> +       if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
> +               return -EFAULT;
> +
> +       if (arg->size != sizeof(struct pm_scan_arg))
> +               return -EINVAL;
> +
> +       /* Validate requested features */
> +       if (arg->flags & ~PM_SCAN_FLAGS)
> +               return -EINVAL;
> +       if ((arg->category_inverted | arg->category_mask |
> +            arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
> +               return -EINVAL;
> +
> +       arg->start = untagged_addr((unsigned long)arg->start);
> +       arg->end = untagged_addr((unsigned long)arg->end);
> +       arg->vec = untagged_addr((unsigned long)arg->vec);
> +
> +       /* Validate memory pointers */
> +       if (!IS_ALIGNED(arg->start, PAGE_SIZE))
> +               return -EINVAL;
> +       if (!access_ok((void __user *)arg->start, arg->end - arg->start))
> +               return -EFAULT;
> +       if (!arg->vec && arg->vec_len)
> +               return -EFAULT;

It looks more like EINVAL.

> +       if (arg->vec && !access_ok((void __user *)arg->vec,
> +                             arg->vec_len * sizeof(struct page_region)))
> +               return -EFAULT;
> +
> +       /* Fixup default values */
> +       arg->end = ALIGN(arg->end, PAGE_SIZE);
> +       if (!arg->max_pages)
> +               arg->max_pages = ULONG_MAX;
> +
> +       return 0;
> +}
> +

Thanks,
Andrei




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux