Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Michał Mirosław <emmir@xxxxxxxxxx> · Tue, 11 Apr 2023 11:29:44 +0200

On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
<usama.anjum@xxxxxxxxxxxxx> wrote:
> On 4/7/23 3:14 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> > <usama.anjum@xxxxxxxxxxxxx> wrote:
> >> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> >>> <usama.anjum@xxxxxxxxxxxxx> wrote:
> >>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>>>> <usama.anjum@xxxxxxxxxxxxx> wrote:
> > [...]
> >>>>>>>> +       /*
> >>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>>>> +        * we want to return output to user in compact form where no two
> >>>>>>>> +        * consecutive regions should be continuous and have the same flags.
> >>>>>>>> +        * So store the latest element in p.cur between different walks and
> >>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>>>>>> +        */
> >>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>>>> +                             GFP_KERNEL);
> >>>>>>>> +       if (!p.vec)
> >>>>>>>> +               return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +       walk_start = walk_end = start;
> >>>>>>>> +       while (walk_end < end && !ret) {
> >>>>>>>
> >>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>>>> error will be lost) - is it intended?
> >>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>>>> there was more memory to walk over. We don't treat this error. So when
> >>>>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>>>> full and return as success.
> >>>>>
> >>>>> Thanks. What's the difference between -ENOSPC and
> >>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>>>> flow).
> >>>> -ENOSPC --> user buffer has been filled completely
> >>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>>>                             still have more space
> >>>
> >>> What is the difference in code behaviour when those two cases are
> >>> compared? (I'd expect none.)
> >> There is difference:
> >> We add data to user buffer. If it succeeds with return code 0, we engage
> >> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> >> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> >> to the user buffer.
> >
> > Thanks! I see it now. I see a few more corner cases here:
> > 1. If we did engage WP but fail to copy the vector we return -EFAULT
> > but the WP is already engaged. I'm not sure this is something worth
> > guarding against, but documenting that would be helpful I think.
> Sure.
>
> > 2. If uffd_wp_range() fails, but we have already processed pages
> > earlier, we should treat the error like ENOSPC and back out the failed
> > range (the earier changes would be lost otherwise).
> Backing out is easier to do for hugepages. But for normal pages, we'll have
> to write some code to find where the current data was added (in cur or in
> vec) and back out from that. I'll have to write some more code to avoid the
> side-effects as well.

If I read the code correctly, the last page should always be in `cur`
and on failure only a single page is needed to be backed out. Did I
miss something?

> But aren't we going over-engineering here? Error occurred and we are trying
> to keep the previously generated correct data and returning successfully
> still to the user? I don't think we should do this. An error is error. We
> should return the error simply even if the memory flags would get lost. We
> don't know what caused the error in uffd_wp_range(). Under normal
> situation, we there shouldn't have had error.

In this case it means that on (intermittent) allocation error we get
inconsistent or non-deterministic results. I wouldn't want to be the
one debugging this later - I'd prefer either the syscall be
"exception-safe" (give consistent and predictable output) or kill the
process.

Best Regards
Michał Mirosław