Re: [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/11/23 12:07 AM, Andrei Vagin wrote:
> On Tue, Aug 8, 2023 at 11:16 PM Muhammad Usama Anjum
> <usama.anjum@xxxxxxxxxxxxx> wrote:
>>
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have Async Write-Protection enabled
>>   (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file
>>   mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped
>>   (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``).
>> - Find pages which have been written to and/or write protect
>>   (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages
>>   atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched
>>   pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if
>>   non-Async-Write-Protected pages are found. Get is automatically performed
>>   if output buffer is specified.
>>
>> This IOCTL can be extended to get information about more PTE bits. The
>> entire address range passed by user [start, end) is scanned until either
>> the user provided buffer is full or max_pages have been found.
>>
>> Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx>
>> Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
>> Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> ---
>> Changes in v28:
>> - Fix walk_end one last time after doing through testing
>>
>> Changes in v27:
>> - Add PAGE_IS_HUGE
>> - Iterate until temporary buffer is full to do less iterations
>> - Don't check if PAGE_IS_FILE if no mask needs it as it is very
>>   expensive to check per pte
>> - bring is_interesting_page() outside pagemap_scan_output() to remove
>>   the horrible return value check
>> - Replace memcpy() with direct copy
>> - rename end_addr to walk_end_addr in pagemap_scan_private
>> - Abort walk if fatal_signal_pending()
>>
>> Changes in v26:
>> Changes made by Usama:
>> - Fix the wrong breaking of loop if page isn't interesting, skip intsead
>> - Untag the address and save them into struct
>> - Round off the end address to next page
>> - Correct the partial hugetlb page handling and returning the error
>> - Rename PAGE_IS_WPASYNC to PAGE_IS_WPALLOWED
>> - Return walk ending address in walk_end instead of returning in start
>>   as there is potential of replacing the memory tag
>>
>> Changes by Michał:
>> 1. the API:
>>   a. return ranges as {begin, end} instead of {begin + len};
>>   b. rename match "flags" to 'page categories' everywhere - this makes
>>         it easier to differentiate the ioctl()s categorisation of pages
>>         from struct page flags;
>>   c. change {required + excluded} to {inverted + required}. This was
>>         rejected before, but I'd like to illustrate the difference.
>>         Old interface can be translated to the new by:
>>                 categories_inverted = excluded_mask
>>                 categories_mask = required_mask | excluded_mask
>>                 categories_anyof_mask = anyof_mask
>>         The new way allows filtering by: A & (B | !C)
>>                 categories_inverted = C
>>                 categories_mask = A
>>                 categories_anyof_mask = B | C
>>   e. allow no-op calls
>> 2. the implementation:
>>   a. gather the page-categorising and write-protecting code in one place;
>>   b. optimization: add whole-vma skipping for WP usecase;
>>   c. extracted output limiting code to pagemap_scan_output();
>>   d. extracted range coalescing to pagemap_scan_push_range();
>>   e. extracted THP entry handling to pagemap_scan_thp_entry();
>>   f. added a shortcut for non-WP hugetlb scan; avoids conditional
>>         locking;
>>   g. extracted scan buffer handling code out of do_pagemap_scan();
>>   h. rework output code to always try to write pending ranges; if EFAULT
>>         is generated it always overwrites the original error code;
>>         (the case of SIGKILL is needlessly trying to write the output
>>         now, but this should be rare case and ignoring it makes the code
>>         not needing a goto)
>> 3.Change no-GET operation condition from `arg.return_mask == 0` to
>>   `arg.vec == NULL`. This will allow issuing the ioctl with
>>   return_mask == 0 to gather matching ranges when the exact category
>>   is not interesting. (Anticipated for CRIU scanning a large sparse
>>   anonymous mapping).
>>
>> Changes in v25:
>> - Do proper filtering on hole as well (hole got missed earlier)
>>
>> Changes in v24:
>> - Place WP markers in case of hole as well
>>
>> Changes in v23:
>> - Set vec_buf_index to 0 only when vec_buf_index is set
>> - Return -EFAULT instead of -EINVAL if vec is NULL
>> - Correctly return the walk ending address to the page granularity
>>
>> Changes in v22:
>> - Interface change to return walk ending address to user:
>>   - Replace [start start + len) with [start, end)
>>   - Return the ending address of the address walk in start
>>
>> Changes in v21:
>> - Abort walk instead of returning error if WP is to be performed on
>>   partial hugetlb
>> - Changed the data types of some variables in pagemap_scan_private to
>>   long
>>
>> Changes in v20:
>> - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
>>
>> Changes in v19:
>> - Interface changes such as renaming, return mask and WP can be used
>>   with any flags specified in masks
>> - Internal code changes
>>
>> Changes in v18:
>> - Rebased on top of next-20230613
>>   - ptep_get() updates
>>   - remove pmd_trans_unstable() and add ACTION_AGAIN
>> - Review updates (Micheal)
>>
>> Changes in v17:
>> - Rebased on next-20230606
>> - Made make_uffd_wp_*_pte() better and minor changes
>>
>> Changes in v16:
>> - Fixed a corner case where kernel writes beyond user buffer by one
>>   element
>> - Bring back exclusive PM_SCAN_OP_WP
>> - Cosmetic changes
>>
>> Changes in v15:
>> - Build fix:
>>   - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of
>>     using x86 specific flush function in do_pagemap_scan()
>>   - Remove #ifdef from pagemap_scan_hugetlb_entry()
>>   - Use mm instead of undefined vma->vm_mm
>>
>> Changes in v14:
>> - Fix build error caused by #ifdef added at last minute in some configs
>>
>> Changes in v13:
>> - Review updates
>> - mmap_read_lock_killable() instead of mmap_read_lock()
>> - Replace uffd_wp_range() with helpers which increases performance
>>   drastically for OP_WP operations by reducing the number of tlb
>>   flushing etc
>> - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range
>>
>> Changes in v12:
>> - Add hugetlb support to cover all memory types
>> - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
>> - Review updates to the code
>>
>> Changes in v11:
>> - Find written pages in a better way
>> - Fix a corner case (thanks Paul)
>> - Improve the code/comments
>> - remove ENGAGE_WP + ! GET operation
>> - shorten the commit message in favour of moving documentation to
>>   pagemap.rst
>>
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>> ---
>>  fs/proc/task_mmu.c      | 678 ++++++++++++++++++++++++++++++++++++++++
>>  include/linux/hugetlb.h |   1 +
>>  include/uapi/linux/fs.h |  59 ++++
>>  mm/hugetlb.c            |   2 +-
>>  4 files changed, 739 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index c1e6531cb02ae..0e219a44e97cd 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,8 @@
>>  #include <linux/shmem_fs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>> +#include <linux/overflow.h>
>>
>>  #include <asm/elf.h>
>>  #include <asm/tlb.h>
>> @@ -1749,11 +1751,687 @@ static int pagemap_release(struct inode *inode, struct file *file)
>>         return 0;
>>  }
>>
>> +#define PM_SCAN_CATEGORIES     (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN |  \
>> +                                PAGE_IS_FILE | PAGE_IS_PRESENT |       \
>> +                                PAGE_IS_SWAPPED | PAGE_IS_PFNZERO |    \
>> +                                PAGE_IS_HUGE)
>> +#define PM_SCAN_FLAGS          (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
>> +
>> +#define MASKS_OF_INTEREST(a)   (a.category_inverted | a.category_mask | \
>> +                                a.category_anyof_mask | a.return_mask)
>> +
>> +struct pagemap_scan_private {
>> +       struct pm_scan_arg arg;
>> +       unsigned long masks_of_interest, cur_vma_category;
>> +       struct page_region *vec_buf, cur_buf;
> 
> I think we can remove cur_buf. Imho, it makes code a bit more readable.
> Here is a quick poc patch:
> https://gist.github.com/avagin/2e465e7c362c515ec84d72a201a28de4
I thought ohhh how can this be removed initially. But considering that we
have moved to walking full range until temporary buffer is full, removing
cur_buf is possible. You have proved with your POC as well. Thank you for
doing it. I've updated it after testing and simplified it further.

> 
>> +       unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr;
>> +       struct page_region __user *vec_out;
>> +};
> 
> ...
> 
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
>> +                                     unsigned long start, unsigned long end,
>> +                                     struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long categories;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t pte;
>> +
>> +       if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
>> +               /* Go the short route when not write-protecting pages. */
>> +
>> +               pte = huge_ptep_get(ptep);
>> +               categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
>> +
>> +               if (!pagemap_scan_is_interesting_page(categories, p))
>> +                       return 0;
>> +
>> +               return pagemap_scan_output(categories, p, start, &end);
>> +       }
>> +
>> +       i_mmap_lock_write(vma->vm_file->f_mapping);
>> +       ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
>> +
>> +       pte = huge_ptep_get(ptep);
>> +       categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
>> +
>> +       if (!pagemap_scan_is_interesting_page(categories, p))
>> +               goto out_unlock;
>> +
>> +       ret = pagemap_scan_output(categories, p, start, &end);
>> +       if (start == end)
>> +               goto out_unlock;
>> +
>> +       if (~categories & PAGE_IS_WRITTEN)
>> +               goto out_unlock;
>> +
>> +       if (end != start + HPAGE_SIZE) {
>> +               /* Partial HugeTLB page WP isn't possible. */
>> +               pagemap_scan_backout_range(p, start, end, start);
>> +               ret = -EINVAL;
> 
> Will this error be returned from ioctl? If the answer is yet, it looks
> wrong to me.
Sorry, we missed it in previous revisions. I'll return 0 here and walk_end
will indicate to user that we have not walked the entire range.

> 
>> +               goto out_unlock;
>> +       }
>> +
>> +       make_uffd_wp_huge_pte(vma, start, ptep, pte);
>> +       flush_hugetlb_tlb_range(vma, start, end);
>> +
>> +out_unlock:
>> +       spin_unlock(ptl);
>> +       i_mmap_unlock_write(vma->vm_file->f_mapping);
>> +
>> +       return ret;
>> +}
> 
> ....
> 
>> +static int pagemap_scan_get_args(struct pm_scan_arg *arg,
>> +                                unsigned long uarg)
>> +{
>> +       if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
>> +               return -EFAULT;
>> +
>> +       if (arg->size != sizeof(struct pm_scan_arg))
>> +               return -EINVAL;
>> +
>> +       /* Validate requested features */
>> +       if (arg->flags & ~PM_SCAN_FLAGS)
>> +               return -EINVAL;
>> +       if ((arg->category_inverted | arg->category_mask |
>> +            arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
>> +               return -EINVAL;
>> +
>> +       arg->start = untagged_addr((unsigned long)arg->start);
>> +       arg->end = untagged_addr((unsigned long)arg->end);
>> +       arg->vec = untagged_addr((unsigned long)arg->vec);
>> +
>> +       /* Validate memory pointers */
>> +       if (!IS_ALIGNED(arg->start, PAGE_SIZE))
>> +               return -EINVAL;
>> +       if (!access_ok((void __user *)arg->start, arg->end - arg->start))
>> +               return -EFAULT;
>> +       if (!arg->vec && arg->vec_len)
>> +               return -EFAULT;
> 
> It looks more like EINVAL.
Updated for next revision.

> 
>> +       if (arg->vec && !access_ok((void __user *)arg->vec,
>> +                             arg->vec_len * sizeof(struct page_region)))
>> +               return -EFAULT;
>> +
>> +       /* Fixup default values */
>> +       arg->end = ALIGN(arg->end, PAGE_SIZE);
>> +       if (!arg->max_pages)
>> +               arg->max_pages = ULONG_MAX;
>> +
>> +       return 0;
>> +}
>> +
> 
> Thanks,
> Andrei

-- 
BR,
Muhammad Usama Anjum



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux