On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> wrote: > > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Find pages which have been written-to and/or write protect the pages > (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP) > > This IOCTL can be extended to get information about more PTE bits. [...] > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c [...] > +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, > + unsigned long addr, unsigned int n_pages) > +{ > + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); > + struct page_region *cur_buf = &p->cur_buf; Maybe we can go a step further and say here `cur_buf = &p->vec_buf[p->vec_buf_index];` and remove `p->cur_buf` entirely? > + > + if (!n_pages) > + return -EINVAL; > + > + if ((p->required_mask & bitmap) != p->required_mask) > + return 0; > + if (p->anyof_mask && !(p->anyof_mask & bitmap)) > + return 0; > + if (p->excluded_mask & bitmap) > + return 0; > + > + bitmap &= p->return_mask; > + if (!bitmap) > + return 0; > + > + if (cur_buf->bitmap == bitmap && > + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) { > + cur_buf->len += n_pages; > + p->found_pages += n_pages; > + } else { > + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len) > + return -ENOMEM; Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel ran out of memory when allocating, not that there is no space in a user-provided buffer. BTW, the check could be inside the if() below for easier reading and less redundancy. > + if (cur_buf->len) { > + memcpy(&p->vec_buf[p->vec_buf_index], cur_buf, > + sizeof(*p->vec_buf)); > + p->vec_buf_index++; > + } > + > + cur_buf->start = addr; > + cur_buf->len = n_pages; > + cur_buf->bitmap = bitmap; > + p->found_pages += n_pages; > + } > + > + if (p->max_pages && (p->found_pages == p->max_pages)) Since `p->found_pages > 0` holds here, the first check is redundant. Nit: the parentheses around == are not needed. > + return PM_SCAN_FOUND_MAX_PAGES; > + > + return 0; > +} [...] > +static inline unsigned long pagemap_scan_check_page_written(struct pagemap_scan_private *p) > +{ > + return ((p->required_mask | p->anyof_mask | p->excluded_mask) & > + PAGE_IS_WRITTEN) ? PM_SCAN_OP_WRITE : 0; > +} Please inline at the single callsite. For flags name: PM_REQUIRE_WRITE_ACCESS? Or Is it intended to be checked only if doing WP (as the current name suggests) and so it would be redundant as WP currently requires `p->required_mask = PAGE_IS_WRITTEN`? > +static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start, > + struct page_region __user *vec) > +{ > + /* Detect illegal size, flags, len and masks */ > + if (arg->size != sizeof(struct pm_scan_arg)) > + return -EINVAL; > + if (arg->flags & ~PM_SCAN_OPS) > + return -EINVAL; > + if (!arg->len) > + return -EINVAL; > + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask | > + arg->return_mask) & ~PM_SCAN_BITS_ALL) > + return -EINVAL; > + if (!arg->required_mask && !arg->anyof_mask && > + !arg->excluded_mask) > + return -EINVAL; > + if (!arg->return_mask) > + return -EINVAL; I just noticed that `!arg->return_mask == !IS_PM_SCAN_GET()`. So the OP_GET is redundant and can be removed from the `flags` if checking `return_mask` instead. That way there won't be a "flags-encoded ops" thing as it would be a 'scan' with optional 'write-protect'. > + > + /* Validate memory range */ > + if (!IS_ALIGNED(start, PAGE_SIZE)) > + return -EINVAL; > + if (!access_ok((void __user *)start, arg->len)) > + return -EFAULT; > + > + if (IS_PM_SCAN_GET(arg->flags)) { > + if (!arg->vec) > + return -EINVAL; access_ok() -> -EFAULT (an early fail-check) (the vec_len should be checked first then, failing with -EINVAL if 0) > + if (arg->vec_len == 0) > + return -EINVAL; > + } > + > + if (IS_PM_SCAN_WP(arg->flags)) { > + if (!IS_PM_SCAN_GET(arg->flags) && arg->max_pages) > + return -EINVAL; > + > + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) & > + ~PAGE_IS_WRITTEN) Is `excluded_mask = PAGE_IS_WRITTEN` intended to be allowed? It would make WP do nothing, unless the required/anyof/excluded masks are not supposed to limit WP? > + return -EINVAL; > + } If `flags == 0` (and `return_mask == 0` in case my earlier comment is applied) it should fail with -EINVAL. [...] > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > +/* > + * struct page_region - Page region with bitmap flags > + * @start: Start of the region > + * @len: Length of the region in pages > + * bitmap: Bits sets for the region '@' is missing for the third field. BTW, maybe we can call it something like `flags` or `category` (something that hints at the meaning of the value instead of its data representation). > +/* > + * struct pm_scan_arg - Pagemap ioctl argument > + * @size: Size of the structure > + * @flags: Flags for the IOCTL > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this length are included) Maybe `scan_start`, `scan_len` - so that there is a better distinction from the structure's `size` field? > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional max return pages > + * @required_mask: Required mask - All of these bits have to be set in the PTE > + * @anyof_mask: Any mask - Any of these bits are set in the PTE > + * @excluded_mask: Exclude mask - None of these bits are set in the PTE > + * @return_mask: Bits that are to be reported in page_region > + */ I skipped most of the page walk implementation as maybe the comments above could make it simpler. Reading this patch and the documentation I still feel confused about how the filtering/limiting parameters should affect GET, WP and WP+GET. Should they limit the pages walked (and WP-ed)? Or only the GET's output? How about GET+WP case? Best Regards Michał Mirosław