Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED)

Nadav Amit <nadav.amit@xxxxxxxxx> · Wed, 13 Oct 2021 08:47:11 -0700

> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@xxxxxxxx> wrote:
>>> 
>>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>>> 
>>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@xxxxxxxx> wrote:
>>>>> 
>>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>>>> [...]
>>>>>> The manager is notified on memory regions that it should monitor
>>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>>>> to reclaim (anonymous) memory, it:
>>>>>> 
>>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>>>> 3. Write it back to “swap”.
>>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>>> 
>>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>>> 
>>>> Providing hints to the kernel takes you so far to a certain extent.
>>>> The kernel does not want to (for a good reason) to be completely
>>>> configurable when it comes to reclaim and prefetch policies. Doing
>>>> so from userspace allows you to be fully configurable.
>>> 
>>> I am sorry but I do not follow. Your scenario is describing a user
>>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
>>> designed for. What are you missing in the existing functionality?
>> 
>> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
>> many aspects of paging out memory:
>> 
>> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
>> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>>  on non-contiguous memory).
>> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>>  and the time it takes place.
>> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>>  as non-performant as it is cannot be used for swap AFAIK).
>> 5. Other operations (e.g., locking, working set tracking) that
>>  might not be necessary or interfere.
>> 
>> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
>> of userfaultfd to trap page-faults and react accordingly, so you
>> are also prevented from:
>> 
>> 6. Having your own custom prefetching policy in response to #PF.
>> 
>> There are additional use-cases I can try to formalize in which
>> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
>> is pretty clear, I think: one is a hint that only applied to
>> page reclamation. The other enables the direct control of
>> userspace over (almost) all aspects of paging.
>> 
>> As I suggested before, if it is preferred, this can be a UFFD
>> IOCTL instead of process_madvise() behavior, thereby lowering
>> the risk of a misuse.
> 
> (Sorry to join so late..)
> 
> Yeah I'm wondering whether that could add one extra layer of security.  But as
> you mentioned, we've already have process_vm_writev(), then it's indeed not
> strong reason to reject process_madvise(DONTNEED) too, it seems.
> 
> Not sure whether you're aware of the umap project from LLNL:
> 
> https://github.com/LLNL/umap
> 
> From what I can tell, that's really doing very similar thing as what you
> proposed here, but it's just a local version of things.  IOW in umap the
> DONTNEED can be done locally with madvise() already in the umap maintained
> threads.  That close the need to introduce the new process_madvise() interface
> and it's definitely safer as it's per-mm and per-task.
> 
> I think you mentioned above that the tracee program will need to cooperate in
> this case, I'm wondering whether some solution like umap would be fine too as
> that also requires cooperation of the tracee program, it's just that the
> cooperation may be slightly more than your solution but frankly I think that's
> still trivial and before I understand the details of your solution I can't
> really tell..
> 
> E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> where we want the buffers to be managed by umap library rather than the kernel,
> then link against the umap library should work.  If the remote solution you're
> proposing requires similar (or even more complicated) cooperation, then it'll
> be controversial whether that can be done per-mm just like how umap designed
> and used.  So IMHO it'll be great to share more details on those parts if umap
> cannot satisfy the current need - IMHO it satisfies all the features you
> described on fully customized pageout and page faulting in, it's just done in a
> single mm.

Thanks for you feedback, Peter.

I am familiar with umap, perhaps not enough, but I am aware.

From my experience, the existing interfaces are not sufficient if you look
for high performance (low overhead) solution for multiple processes. The
level of cooperation that I mentioned is something that I mentioned
preemptively to avoid unnecessary discussion, but I believe they can be
resolved (I have just deferred handling them).

Specifically for performance, several new kernel features are needed, for
instance, support for iouring with async operations, a vectored
UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
vectored madvise(). Even if we talk on the context of a single mm, I
cannot see umap being performant for low latency devices without those
facilities.

Anyhow, I take your feedback and I will resend the patch for enabling
MADV_DONTNEED with other patches once I am done. As for the TLB batching
itself, I think it has an independent value - but I am not going to
argue about it now if there is a pushback against it.

Attachment:
signature.asc

Description: Message signed with OpenPGP