On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> wrote: > I'll send next revision now. > On 6/14/23 11:00 PM, Michał Mirosław wrote: > > (A quick reply to answer open questions in case they help the next version.) > > > > On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum > > <usama.anjum@xxxxxxxxxxxxx> wrote: > >> On 6/14/23 8:14 PM, Michał Mirosław wrote: > >>> On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum > >>> <usama.anjum@xxxxxxxxxxxxx> wrote: > >>>> > >>>> On 6/14/23 3:36 AM, Michał Mirosław wrote: > >>>>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum > >>>>> <usama.anjum@xxxxxxxxxxxxx> wrote: > > [...] > >>>>>> + if (cur_buf->bitmap == bitmap && > >>>>>> + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) { > >>>>>> + cur_buf->len += n_pages; > >>>>>> + p->found_pages += n_pages; > >>>>>> + } else { > >>>>>> + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len) > >>>>>> + return -ENOMEM; > >>>>> > >>>>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel > >>>>> ran out of memory when allocating, not that there is no space in a > >>>>> user-provided buffer. > >>>> There are 3 kinds of return values here: > >>>> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the > >>>> page walk from next entry > >>>> * 0 ---> continue the page walk > >>>> * -ENOMEM --> Abort the page walk from current entry, user buffer is full > >>>> which is not error, but only a stop signal. This -ENOMEM is just > >>>> differentiater from (1). This -ENOMEM is for internal use and isn't > >>>> returned to user. > >>> > >>> But why ENOSPC is not good here? I was used before, I think. > >> -ENOSPC is being returned in form of true error from > >> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it > >> wasn't true error here, it was only a way to abort the walk immediately. > >> I'm liking the following erturn code from here now: > >> > >> #define PM_SCAN_BUFFER_FULL (-256) > > > > I guess this will be reworked anyway, but I'd prefer this didn't need > > custom errors etc. If we agree to decoupling the selection and GET > > output, it could be: > > > > bool is_interesting_page(p, flags); // this one does the > > required/anyof/excluded match > > size_t output_range(p, start, len, flags); // this one fills the > > output vector and returns how many pages were fit > > > > In this setup, `is_interesting_page() && (n_out = output_range()) < > > n_pages` means this is the final range, no more will fit. And if > > `n_out == 0` then no pages fit and no WP is needed (no other special > > cases). > Right now, pagemap_scan_output() performs the work of both of these two > functions. The part can be broken into is_interesting_pages() and we can > leave the remaining part as it is. > > Saying that n_out < n_pages tells us the buffer is full covers one case. > But there is case of maximum pages have been found and walk needs to be > aborted. This case is exactly what `n_out < n_pages` will cover (if scan_output uses max_pages properly to limit n_out). Isn't it that when the buffer is full we want to abort the scan always (with WP if `n_out > 0`)? > >>>>> For flags name: PM_REQUIRE_WRITE_ACCESS? > >>>>> Or Is it intended to be checked only if doing WP (as the current name > >>>>> suggests) and so it would be redundant as WP currently requires > >>>>> `p->required_mask = PAGE_IS_WRITTEN`? > >>>> This is intended to indicate that if userfaultfd is needed. If > >>>> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if > >>>> userfaultfd has been initialized for this memory. I'll rename to > >>>> PM_SCAN_REQUIRE_UFFD. > >>> > >>> Why do we need that check? Wouldn't `is_written = false` work for vmas > >>> not registered via uffd? > >> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region > >> for it to report correct written values on the memory region. Without UFFD > >> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state > >> undefined. If user hasn't initialized memory with UFFD, he has no right to > >> set is_written = false. > > > > How about calculating `is_written = is_uffd_registered() && > > is_uffd_wp()`? This would enable a user to apply GET+WP for the whole > > address space of a process regardless of whether all of it is > > registered. > I wouldn't want to check if uffd is registered again and again. This is why > we are doing it only once every walk in pagemap_scan_test_walk(). There is no need to do the checks repeatedly. If I understand the code correctly, uffd registration is per-vma, so it can be communicated from test_walk to entry/hole callbacks via a field in pagemap_scan_private. > >>> While here, I wonder if we really need to fail the call if there are > >>> unknown bits in those masks set: if this bit set is expanded with > >>> another category flags, a newer userspace run on older kernel would > >>> get EINVAL even if the "treat unknown as 0" be what it requires. > >>> There is no simple way in the API to discover what bits the kernel > >>> supports. We could allow a no-op (no WP nor GET) call to help with > >>> that and then rejecting unknown bits would make sense. > >> I've not seen any examples of this. But I've seen examples of returning > >> error if kernel doesn't support a feature. Each new feature comes with a > >> kernel version, greater than this version support this feature. If user is > >> trying to use advanced feature which isn't present in a kernel, we should > >> return error and not proceed to confuse the user/kernel. In fact if we look > >> at userfaultfd_api(), we return error immediately if feature has some bit > >> set which kernel doesn't support. > > > > I think we should have a way of detecting the supported flags if we > > don't want a forward compatibility policy for flags here. Maybe it > > would be enough to allow all the no-op combinations for this purpose? > Again I don't think UFFD is doing anything like this. If it's cheap and easy to provide a user with a way to detect the supported features - why not do it? Best Regards Michał Mirosław