> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: >> >> >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@xxxxxxxx> wrote: >>> >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote: >>>> >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@xxxxxxxx> wrote: >>>>> >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: >>>>> [...] >>>>>> The manager is notified on memory regions that it should monitor >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions >>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants >>>>>> to reclaim (anonymous) memory, it: >>>>>> >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet). >>>>>> 2. Calls process_vm_readv() to read that memory of that process. >>>>>> 3. Write it back to “swap”. >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. >>>>> >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? >>>> >>>> Providing hints to the kernel takes you so far to a certain extent. >>>> The kernel does not want to (for a good reason) to be completely >>>> configurable when it comes to reclaim and prefetch policies. Doing >>>> so from userspace allows you to be fully configurable. >>> >>> I am sorry but I do not follow. Your scenario is describing a user >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been >>> designed for. What are you missing in the existing functionality? >> >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control >> many aspects of paging out memory: >> >> 1. Writeback: writeback ahead of time, dynamic clustering, etc. >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job >> on non-contiguous memory). >> 3. No guarantee the page is actually reclaimed (e.g., writeback) >> and the time it takes place. >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE >> as non-performant as it is cannot be used for swap AFAIK). >> 5. Other operations (e.g., locking, working set tracking) that >> might not be necessary or interfere. >> >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use >> of userfaultfd to trap page-faults and react accordingly, so you >> are also prevented from: >> >> 6. Having your own custom prefetching policy in response to #PF. >> >> There are additional use-cases I can try to formalize in which >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference >> is pretty clear, I think: one is a hint that only applied to >> page reclamation. The other enables the direct control of >> userspace over (almost) all aspects of paging. >> >> As I suggested before, if it is preferred, this can be a UFFD >> IOCTL instead of process_madvise() behavior, thereby lowering >> the risk of a misuse. > > (Sorry to join so late..) > > Yeah I'm wondering whether that could add one extra layer of security. But as > you mentioned, we've already have process_vm_writev(), then it's indeed not > strong reason to reject process_madvise(DONTNEED) too, it seems. > > Not sure whether you're aware of the umap project from LLNL: > > https://github.com/LLNL/umap > > From what I can tell, that's really doing very similar thing as what you > proposed here, but it's just a local version of things. IOW in umap the > DONTNEED can be done locally with madvise() already in the umap maintained > threads. That close the need to introduce the new process_madvise() interface > and it's definitely safer as it's per-mm and per-task. > > I think you mentioned above that the tracee program will need to cooperate in > this case, I'm wondering whether some solution like umap would be fine too as > that also requires cooperation of the tracee program, it's just that the > cooperation may be slightly more than your solution but frankly I think that's > still trivial and before I understand the details of your solution I can't > really tell.. > > E.g. for a program to use umap, I think it needs to replace mmap() to umap() > where we want the buffers to be managed by umap library rather than the kernel, > then link against the umap library should work. If the remote solution you're > proposing requires similar (or even more complicated) cooperation, then it'll > be controversial whether that can be done per-mm just like how umap designed > and used. So IMHO it'll be great to share more details on those parts if umap > cannot satisfy the current need - IMHO it satisfies all the features you > described on fully customized pageout and page faulting in, it's just done in a > single mm. Thanks for you feedback, Peter. I am familiar with umap, perhaps not enough, but I am aware. From my experience, the existing interfaces are not sufficient if you look for high performance (low overhead) solution for multiple processes. The level of cooperation that I mentioned is something that I mentioned preemptively to avoid unnecessary discussion, but I believe they can be resolved (I have just deferred handling them). Specifically for performance, several new kernel features are needed, for instance, support for iouring with async operations, a vectored UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a vectored madvise(). Even if we talk on the context of a single mm, I cannot see umap being performant for low latency devices without those facilities. Anyhow, I take your feedback and I will resend the patch for enabling MADV_DONTNEED with other patches once I am done. As for the TLB batching itself, I think it has an independent value - but I am not going to argue about it now if there is a pushback against it.
Attachment:
signature.asc
Description: Message signed with OpenPGP