On Mon, Aug 2, 2021 at 5:21 AM Alejandro Colomar (man-pages) <alx.manpages@xxxxxxxxx> wrote: > > Hi Mike, Axel, > > On 8/2/21 1:21 PM, Mike Rapoport wrote: > > (added man-pages maintainers) > > Thanks! If I'm not CCed, I may not notice the email, depending on the > traffic of the lists, and the amount of time I have ;) > > > > > On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote: > >> Any remaining issues with this patch? I just realized today it was > >> never merged. 5.13 (which contains this new feature) was released some > >> weeks ago. > > Please see some minor formatting issues I commented below. > Other than that, it looks good to me. > > Thanks, > > Alex > > >> > >> On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote: > >>> > >>> Userfaultfd minor fault mode is supported starting from Linux 5.13. > >>> > >>> This commit adds a description of the new mode, as well as the new ioctl > >>> used to resolve such faults. The two go hand-in-hand: one can't resolve > >>> a minor fault without continue, and continue can't be used to resolve > >>> any other kind of fault. > >>> > >>> This patch covers just the hugetlbfs implementation (in 5.13). Support > >>> for shmem is forthcoming, but as it has not yet made it into a kernel > >>> release candidate, it will be added in a future commit. > >>> > >>> Signed-off-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx> > >>> --- > >>> man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++--- > >>> man2/userfaultfd.2 | 79 ++++++++++++++++++++----- > >>> 2 files changed, 182 insertions(+), 22 deletions(-) > >>> > >>> diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 > >>> index 504f61d4b..7b990c24a 100644 > >>> --- a/man2/ioctl_userfaultfd.2 > >>> +++ b/man2/ioctl_userfaultfd.2 > >>> @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd. > >>> If this feature bit is set, > >>> .I uffd_msg.pagefault.feat.ptid > >>> will be set to the faulted thread ID for each page-fault message. > >>> +.TP > >>> +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)" > >>> +If this feature bit is set, the kernel supports registering userfaultfd ranges > >>> +in minor mode on hugetlbfs-backed memory areas. > > See the folowing extract from man-pages(7): > > Use semantic newlines > In the source of a manual page, new sentences should be > started on new lines, and long sentences should be split > into lines at clause breaks (commas, semicolons, colons, > and so on). This convention, sometimes known as "semantic > newlines", makes it easier to see the effect of patches, > which often operate at the level of individual sentences or > sentence clauses. > > A trick to check if some text is correct at first glance, is that the > following regex should rarely match: > [,;:.] \+\w > > Multi-sentence parenthetical expressions should also go on separate > lines normally. > > I'd for example break the above text into: > > [ > If this feature bit is set, > the kernel supports registering userfaultfd ranges > in minor mode on hugetlbfs-backed memory areas > ] > > Note the break after the comma, and another break at a sensible point > (you already did that one correctly in this example, but some below don't). > > > >>> .PP > >>> The returned > >>> .I ioctls > >>> @@ -240,6 +244,11 @@ operation is supported. > >>> The > >>> .B UFFDIO_WRITEPROTECT > >>> operation is supported. > >>> +.TP > >>> +.B 1 << _UFFDIO_CONTINUE > >>> +The > >>> +.B UFFDIO_CONTINUE > >>> +operation is supported. > >>> .PP > >>> This > >>> .BR ioctl (2) > >>> @@ -278,14 +287,8 @@ by the current kernel version. > >>> (Since Linux 4.3.) > >>> Register a memory address range with the userfaultfd object. > >>> The pages in the range must be "compatible". > >>> -.PP > >>> -Up to Linux kernel 4.11, > >>> -only private anonymous ranges are compatible for registering with > >>> -.BR UFFDIO_REGISTER . > >>> -.PP > >>> -Since Linux 4.11, > >>> -hugetlbfs and shared memory ranges are also compatible with > >>> -.BR UFFDIO_REGISTER . > >>> +Please refer to the list of register modes below for the compatible memory > >>> +backends for each mode. > > Regarding semantic newlines mentioned above: > > Here for example, a more sensible point to break the line would be just > after (or maybe before, up to you) the first "for". > > >>> .PP > >>> The > >>> .I argp > >>> @@ -324,9 +327,16 @@ the specified range: > >>> .TP > >>> .B UFFDIO_REGISTER_MODE_MISSING > >>> Track page faults on missing pages. > >>> +Since Linux 4.3, only private anonymous ranges are compatible. > >>> +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible. > >>> .TP > >>> .B UFFDIO_REGISTER_MODE_WP > >>> Track page faults on write-protected pages. > >>> +Since Linux 5.7, only private anonymous ranges are compatible. > >>> +.TP > >>> +.B UFFDIO_REGISTER_MODE_MINOR > >>> +Track minor page faults. > >>> +Since Linux 5.13, only hugetlbfs ranges are compatible. > >>> .PP > >>> If the operation is successful, the kernel modifies the > >>> .I ioctls > >>> @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode. > >>> .TP > >>> .B EFAULT > >>> Encountered a generic fault during processing. > >>> +.\" > >>> +.SS UFFDIO_CONTINUE > >>> +(Since Linux 5.13.) > >>> +Resolve a minor page fault by installing page table entries for existing pages > >>> +in the page cache. > >>> +.PP > >>> +The > >>> +.I argp > >>> +argument is a pointer to a > >>> +.I uffdio_continue > >>> +structure as shown below: > >>> +.PP > >>> +.in +4n > >>> +.EX > >>> +struct uffdio_continue { > >>> + struct uffdio_range range; /* Range to install PTEs for and continue */ > >>> + __u64 mode; /* Flags controlling the behavior of continue */ > >>> + __s64 mapped; /* Number of bytes mapped, or negated error */ > >>> +}; > >>> +.EE > >>> +.in > >>> +.PP > >>> +The following value may be bitwise ORed in > >>> +.IR mode > >>> +to change the behavior of the > >>> +.B UFFDIO_CONTINUE > >>> +operation: > >>> +.TP > >>> +.B UFFDIO_CONTINUE_MODE_DONTWAKE > >>> +Do not wake up the thread that waits for page-fault resolution. > >>> +.PP > >>> +The > >>> +.I mapped > >>> +field is used by the kernel to return the number of bytes > >>> +that were actually mapped, or an error in the same manner as > >>> +.BR UFFDIO_COPY . > >>> +If the value returned in the > >>> +.I mapped > >>> +field doesn't match the value that was specified in > >>> +.IR range.len , > >>> +the operation fails with the error > >>> +.BR EAGAIN . > >>> +The > >>> +.I mapped > >>> +field is output-only; > >>> +it is not read by the > >>> +.B UFFDIO_CONTINUE > >>> +operation. > >>> +.PP > >>> +This > >>> +.BR ioctl (2) > >>> +operation returns 0 on success. > >>> +In this case, the entire area was mapped. > >>> +On error, \-1 is returned and > >>> +.I errno > >>> +is set to indicate the error. > >>> +Possible errors include: > >>> +.TP > >>> +.B EAGAIN > >>> +The number of bytes mapped (i.e., the value returned in the > >>> +.I mapped > >>> +field) does not equal the value that was specified in the > >>> +.I range.len > >>> +field. > >>> +.TP > >>> +.B EINVAL > >>> +Either > >>> +.I range.start > >>> +or > >>> +.I range.len > >>> +was not a multiple of the system page size; or > >>> +.I range.len > >>> +was zero; or the range specified was invalid. > >>> +.TP > >>> +.B EINVAL > >>> +An invalid bit was specified in the > >>> +.IR mode > >>> +field. > >>> +.TP > >>> +.B EEXIST > >>> +One or more pages were already mapped in the given range. > >>> +.TP > >>> +.B ENOENT > >>> +The faulting process has changed its virtual memory layout simultaneously with > >>> +an outstanding > >>> +.B UFFDIO_CONTINUE > >>> +operation. > >>> +.TP > >>> +.B ENOMEM > >>> +Allocating memory needed to setup the page table mappings failed. > >>> +.TP > >>> +.B EFAULT > >>> +No existing page could be found in the page cache for the given range. > >>> +.TP > >>> +.BR ESRCH > >>> +The faulting process has exited at the time of a > >>> +.B UFFDIO_CONTINUE > >>> +operation. > >>> +.\" > >>> .SH RETURN VALUE > >>> See descriptions of the individual operations, above. > >>> .SH ERRORS > >>> diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 > >>> index 593c189d8..07f53c6ff 100644 > >>> --- a/man2/userfaultfd.2 > >>> +++ b/man2/userfaultfd.2 > >>> @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered > >>> and unread events are flushed. > >>> .\" > >>> .PP > >>> -Userfaultfd supports two modes of registration: > >>> +Userfaultfd supports three modes of registration: > >>> .TP > >>> .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)" > >>> When registered with > >>> @@ -92,6 +92,18 @@ or an > >>> .B UFFDIO_ZEROPAGE > >>> ioctl. > >>> .TP > >>> +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)" > >>> +When registered with > >>> +.B UFFDIO_REGISTER_MODE_MINOR > >>> +mode, user-space will receive a page-fault notification > > s/user-space/user space/ > > See the following extract from man-pages(7): > > Preferred terms > The following table lists some preferred terms to use in > man pages, mainly to ensure consistency across pages. > > Term Avoid using Notes > ───────────────────────────────────────────────────────────── > [...] > user space userspace > > However, when user space is used as an adjective, per the usual English > rules, we write "user-space". Example: "a user-space program". 100% agreed that "user space" is more correct, but this man page already has many instances of "user-space" in it. I'd suggest we either fix all of them, or just follow the existing convention within this page. How about, leaving this as-is for this patch, to keep the diff tidy, and I can send a follow-up patch to fix all the instances of this in this page? > > >>> +when a minor page fault occurs. > >>> +That is, when a backing page is in the page cache, but > >>> +page table entries don't yet exist. > >>> +The faulted thread will be stopped from execution until the page fault is > >>> +resolved from user-space by an > >>> +.B UFFDIO_CONTINUE > >>> +ioctl. > >>> +.TP > >>> .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)" > >>> When registered with > >>> .B UFFDIO_REGISTER_MODE_WP > >>> @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying > >>> the mode defined at the registration time, will be forwarded by the kernel to > >>> the user-space application. > >>> The application can then use the > >>> -.B UFFDIO_COPY > >>> +.B UFFDIO_COPY , > >>> +.B UFFDIO_ZEROPAGE , > >>> or > >>> -.B UFFDIO_ZEROPAGE > >>> +.B UFFDIO_CONTINUE > >>> .BR ioctl (2) > >>> operations to resolve the page fault. > >>> .PP > >>> @@ -318,6 +331,43 @@ should have the flag > >>> cleared upon the faulted page or range. > >>> .PP > >>> Write-protect mode supports only private anonymous memory. > >>> +.\" > >>> +.SS Userfaultfd minor fault mode (since 5.13) > >>> +Since Linux 5.13, userfaultfd supports minor fault mode. > >>> +In this mode, fault messages are produced not for major faults (where the > >>> +page was missing), but rather for minor faults, where a page exists in the page > >>> +cache, but the page table entries are not yet present. > >>> +The user needs to first check availability of this feature using > >>> +.B UFFDIO_API > >>> +ioctl against the feature bit > >>> +.B UFFD_FEATURE_MINOR_HUGETLBFS > >>> +before using this feature. > >>> +.PP > >>> +To register with userfaultfd minor fault mode, the user needs to initiate the > >>> +.B UFFDIO_REGISTER > >>> +ioctl with mode > >>> +.B UFFD_REGISTER_MODE_MINOR > >>> +set. > >>> +.PP > >>> +When a minor fault occurs, user-space will receive a page-fault notification > >>> +whose > >>> +.I uffd_msg.pagefault.flags > >>> +will have the > >>> +.B UFFD_PAGEFAULT_FLAG_MINOR > >>> +flag set. > >>> +.PP > >>> +To resolve a minor page fault, the handler should decide whether or not the > >>> +existing page contents need to be modified first. > >>> +If so, this should be done in-place via a second, non-userfaultfd-registered > >>> +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice). > >>> +Once the page is considered "up to date", the fault can be resolved by > >>> +initiating an > >>> +.B UFFDIO_CONTINUE > >>> +ioctl, which installs the page table entries and (by default) wakes up the > >>> +faulting thread(s). > >>> +.PP > >>> +Minor fault mode supports only hugetlbfs-backed memory. > >>> +.\" > >>> .SS Reading from the userfaultfd structure > >>> Each > >>> .BR read (2) > >>> @@ -456,19 +506,20 @@ For > >>> the following flag may appear: > >>> .RS > >>> .TP > >>> -.B UFFD_PAGEFAULT_FLAG_WRITE > >>> -If the address is in a range that was registered with the > >>> -.B UFFDIO_REGISTER_MODE_MISSING > >>> -flag (see > >>> -.BR ioctl_userfaultfd (2)) > >>> -and this flag is set, this a write fault; > >>> -otherwise it is a read fault. > >>> +.B UFFD_PAGEFAULT_FLAG_WP > >>> +If this flag is set, then the fault was a write-protect fault. > >>> .TP > >>> +.B UFFD_PAGEFAULT_FLAG_MINOR > >>> +If this flag is set, then the fault was a minor fault. > >>> +.TP > >>> +.B UFFD_PAGEFAULT_FLAG_WRITE > >>> +If this flag is set, then the fault was a write fault. > >>> +.HP > > See the following extract from groff_man(7): > > Deprecated features > Use of the following is discouraged. > > [...] > > .HP [indent] > Set up a paragraph with a hanging left indentation. > The indent argument, if present, is handled as with > .TP. > > Use of this presentation‐level macro is deprecated. > While it is universally portable to legacy Unix sys‐ > tems, a hanging indentation cannot be expressed nat‐ > urally under HTML, and many HTML‐based manual view‐ > ers simply interpret it as a starter for a normal > paragraph. Thus, any information or distinction you > tried to express with the indentation may be lost. > > I'd just use .PP here, I think. > > > >>> +If neither > >>> .B UFFD_PAGEFAULT_FLAG_WP > >>> -If the address is in a range that was registered with the > >>> -.B UFFDIO_REGISTER_MODE_WP > >>> -flag, when this bit is set, it means it is a write-protect fault. > >>> -Otherwise it is a page-missing fault. > >>> +nor > >>> +.B UFFD_PAGEFAULT_FLAG_MINOR > >>> +are set, then the fault was a missing fault. > >>> .RE > >>> .TP > >>> .I pagefault.feat.pid > >>> -- > >>> 2.32.0.rc1.229.g3e70b5a671-goog > >>> > >> > > > > > -- > Alejandro Colomar > Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/ > http://www.alejandro-colomar.es/