> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > Before this patch, when there's any pgtable allocation issues happened > during change_protection(), the error will be ignored from the syscall. > For shmem, there will be an error dumped into the host dmesg. Two issues > with that: > > (1) Doing a trace dump when allocation fails is not anything close to > grace.. > > (2) The user should be notified with any kind of such error, so the user > can trap it and decide what to do next, either by retrying, or stop > the process properly, or anything else. > > For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > pgtable allocation failure happened. It should not normally break anyone, > though. If it breaks, then in good ways. > > One man-page update will be on the way to introduce the new -ENOMEM for > UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > 5.19-till-now kernels. I understand that the current assumption is that change_protection() should fully succeed or fail, and I guess this is the current behavior. However, to be more “future-proof” perhaps this needs to be revisited. For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on userspace request) prevent write-protection of pages that are pinned. This is necessary to allow userspace uffd monitor to avoid write-protection of O_DIRECT’d memory, for instance, that might change even if a uffd monitor considers it write-protected. In such a case, a “partial failure” is possible, since only part of the memory was write-protected. The uffd monitor should be allowed to continue execution, but it has to know the part of the memory that was successfully write-protected. To support “partial failure”, the kernel should return to UFFDIO_WRITEPROTECT-users the number of pages/bytes that were not successfully write-protected, unless no memory was successfully write-protected. (Unlike NUMA, pages that were skipped should be accounted as “successfully write-protected"). I am only raising this subject to avoid multiple API changes.