On Jun 23, 2022, at 4:49 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote: >> On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: >> >>> ⚠ External Email >>> >>> On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote: >>>> From: Nadav Amit <namit@xxxxxxxxxx> >>>> >>>> Using a PTE on x86 with cleared access-bit (aka young-bit) >>>> takes ~600 cycles more than when the access bit is set. At the same >>>> time, setting the access-bit for memory that is not used (e.g., >>>> prefetched) can introduce greater overheads, as the prefetched memory is >>>> reclaimed later than it should be. >>>> >>>> Userfaultfd currently does not set the access-bit (excluding the >>>> huge-pages case). Arguably, it is best to let the user control whether >>>> the access bit should be set or not. The expected use is to request >>>> userfaultfd to set the access-bit when the copy/wp operation is done to >>>> resolve a page-fault, and not to set the access-bit when the memory is >>>> prefetched. >>>> >>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the >>>> young bit to be set. >>>> >>>> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>>> Cc: Hugh Dickins <hughd@xxxxxxxxxx> >>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >>>> Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx> >>>> Cc: Peter Xu <peterx@xxxxxxxxxx> >>>> Cc: David Hildenbrand <david@xxxxxxxxxx> >>>> Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx> >>>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx> >>> >>> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I >>> missed it? Do we need to cover them too? Thanks, >> >> I do not know what the value of not setting the PTE’s access/dirty when it >> comes to performance. The pages won’t be swapped out, just as you wrote in >> your comment in hugetlb_mcopy_atomic_pte(): >> >> /* >> * Always mark UFFDIO_COPY page dirty; note that this may not be >> * extremely important for hugetlbfs for now since swapping is not >> * supported, but we should still be clear in that this page cannot be >> * thrown away at will, even if write bit not set. >> */ >> _dst_pte = huge_pte_mkdirty(_dst_pte); >> _dst_pte = pte_mkyoung(_dst_pte); >> >> If you want for consistency/robustness not to set dirty on read-only >> entries, that’s something that I can do. > > We have more than one options here I guess. > > We could have the hints not available for hugetlbfs, but then we'll both > need to add special document for hugetlbfs (when you write the man-page, > let's say) and also it'll be probably better to fail the ioctls with hints > when applying to hugetlb vmas. > > Leaving them silent to hugetlbfs is another option, it just sounds weird to > me without any kind of documentation so I like this one least. > > Or we could make them work for hugetlbfs too. Not saying that swap will > work some day (but it still could?!) it just seems all consistent as you > said. > > So I'd prefer the last one, but only if you agree. My take is that hints are hints. Following David’s (or was it yours?) feedback, I fixed the description to indicate that this is merely a hint and removed all references to dirty/access bits. The kernel therefore can ignore the hint when it wants to or use it in any other way. I fully agree that this gives the kernel the ability to change the behavior as needed. Note that for write-protected 4KB zero-page (where we share the zero-page) we always set the access-bit, regardless of the hint, because it makes sense: the zero-page is not swappable and therefore the access-bit is set. I think that the lesser user-facing documentation there is on how the feature is *exactly* used by the kernel - is better from an API point of view. So I see no reason to fail or be forced not to set a page as young, just because a hint was *not* provided. This would even be a regression in the behavior. The hint is actually always respected right now, it is just that even if you do not provide the hint, the access/dirty is set. The only consistency I think worth thinking about is with the dirty-bit, and I can add it if you want. Note that the access-bit (in x86) might be set speculatively in contrast to the dirty-bit is only set atomically with a real access. That’s the reason I think it may make sense not to set the dirty without a hint. Is that acceptable? Access-bit always set, dirty-bit according to hint?