On 6/15/23 7:52 PM, Michał Mirosław wrote: > On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum > <usama.anjum@xxxxxxxxxxxxx> wrote: >> I'll send next revision now. >> On 6/14/23 11:00 PM, Michał Mirosław wrote: >>> (A quick reply to answer open questions in case they help the next version.) >>> >>> On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum >>> <usama.anjum@xxxxxxxxxxxxx> wrote: >>>> On 6/14/23 8:14 PM, Michał Mirosław wrote: >>>>> On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum >>>>> <usama.anjum@xxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On 6/14/23 3:36 AM, Michał Mirosław wrote: >>>>>>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum >>>>>>> <usama.anjum@xxxxxxxxxxxxx> wrote: >>> [...] >>>>>>>> + if (cur_buf->bitmap == bitmap && >>>>>>>> + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) { >>>>>>>> + cur_buf->len += n_pages; >>>>>>>> + p->found_pages += n_pages; >>>>>>>> + } else { >>>>>>>> + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len) >>>>>>>> + return -ENOMEM; >>>>>>> >>>>>>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel >>>>>>> ran out of memory when allocating, not that there is no space in a >>>>>>> user-provided buffer. >>>>>> There are 3 kinds of return values here: >>>>>> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the >>>>>> page walk from next entry >>>>>> * 0 ---> continue the page walk >>>>>> * -ENOMEM --> Abort the page walk from current entry, user buffer is full >>>>>> which is not error, but only a stop signal. This -ENOMEM is just >>>>>> differentiater from (1). This -ENOMEM is for internal use and isn't >>>>>> returned to user. >>>>> >>>>> But why ENOSPC is not good here? I was used before, I think. >>>> -ENOSPC is being returned in form of true error from >>>> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it >>>> wasn't true error here, it was only a way to abort the walk immediately. >>>> I'm liking the following erturn code from here now: >>>> >>>> #define PM_SCAN_BUFFER_FULL (-256) >>> >>> I guess this will be reworked anyway, but I'd prefer this didn't need >>> custom errors etc. If we agree to decoupling the selection and GET >>> output, it could be: >>> >>> bool is_interesting_page(p, flags); // this one does the >>> required/anyof/excluded match >>> size_t output_range(p, start, len, flags); // this one fills the >>> output vector and returns how many pages were fit >>> >>> In this setup, `is_interesting_page() && (n_out = output_range()) < >>> n_pages` means this is the final range, no more will fit. And if >>> `n_out == 0` then no pages fit and no WP is needed (no other special >>> cases). >> Right now, pagemap_scan_output() performs the work of both of these two >> functions. The part can be broken into is_interesting_pages() and we can >> leave the remaining part as it is. >> >> Saying that n_out < n_pages tells us the buffer is full covers one case. >> But there is case of maximum pages have been found and walk needs to be >> aborted. > > This case is exactly what `n_out < n_pages` will cover (if scan_output > uses max_pages properly to limit n_out). > Isn't it that when the buffer is full we want to abort the scan always > (with WP if `n_out > 0`)? Wouldn't it be duplication of condition if buffer is full inside pagemap_scan_output() and just outside it. Inside pagemap_scan_output() we check if we have space before putting data inside it. I'm using this same condition to indicate that buffer is full. > >>>>>>> For flags name: PM_REQUIRE_WRITE_ACCESS? >>>>>>> Or Is it intended to be checked only if doing WP (as the current name >>>>>>> suggests) and so it would be redundant as WP currently requires >>>>>>> `p->required_mask = PAGE_IS_WRITTEN`? >>>>>> This is intended to indicate that if userfaultfd is needed. If >>>>>> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if >>>>>> userfaultfd has been initialized for this memory. I'll rename to >>>>>> PM_SCAN_REQUIRE_UFFD. >>>>> >>>>> Why do we need that check? Wouldn't `is_written = false` work for vmas >>>>> not registered via uffd? >>>> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region >>>> for it to report correct written values on the memory region. Without UFFD >>>> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state >>>> undefined. If user hasn't initialized memory with UFFD, he has no right to >>>> set is_written = false. >>> >>> How about calculating `is_written = is_uffd_registered() && >>> is_uffd_wp()`? This would enable a user to apply GET+WP for the whole >>> address space of a process regardless of whether all of it is >>> registered. >> I wouldn't want to check if uffd is registered again and again. This is why >> we are doing it only once every walk in pagemap_scan_test_walk(). > > There is no need to do the checks repeatedly. If I understand the code > correctly, uffd registration is per-vma, so it can be communicated > from test_walk to entry/hole callbacks via a field in > pagemap_scan_private. > >>>>> While here, I wonder if we really need to fail the call if there are >>>>> unknown bits in those masks set: if this bit set is expanded with >>>>> another category flags, a newer userspace run on older kernel would >>>>> get EINVAL even if the "treat unknown as 0" be what it requires. >>>>> There is no simple way in the API to discover what bits the kernel >>>>> supports. We could allow a no-op (no WP nor GET) call to help with >>>>> that and then rejecting unknown bits would make sense. >>>> I've not seen any examples of this. But I've seen examples of returning >>>> error if kernel doesn't support a feature. Each new feature comes with a >>>> kernel version, greater than this version support this feature. If user is >>>> trying to use advanced feature which isn't present in a kernel, we should >>>> return error and not proceed to confuse the user/kernel. In fact if we look >>>> at userfaultfd_api(), we return error immediately if feature has some bit >>>> set which kernel doesn't support. >>> >>> I think we should have a way of detecting the supported flags if we >>> don't want a forward compatibility policy for flags here. Maybe it >>> would be enough to allow all the no-op combinations for this purpose? >> Again I don't think UFFD is doing anything like this. > > If it's cheap and easy to provide a user with a way to detect the > supported features - why not do it? I'm sorry. But it would bring up something new and iterations will be needed to just play around. I like the UFFD way. > > Best Regards > Michał Mirosław -- BR, Muhammad Usama Anjum