Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

David Hildenbrand <david@xxxxxxxxxx> · Wed, 5 Mar 2025 20:35:36 +0100

On 05.03.25 20:26, Lorenzo Stoakes wrote:
On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
On 05.03.25 19:56, Matthew Wilcox wrote:
On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
can happen for each vma of the given address ranges.  Because such tlb
flushes are for address ranges of same process, doing those in a batch
is more efficient while still being safe.  Modify madvise() and
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logics do only gathering of the tlb entries to
flush.

Do real applications actually do madvise requests that span multiple
VMAs?  It just seems weird to me.  Like, each vma comes from a separate
call to mmap [1], so why would it make sense for an application to
call madvise() across a VMA boundary?

I had the same question. If this happens in an app, I would assume that a
single MADV_DONTNEED call would usually not span multiples VMAs, and if it
does, not that many (and that often) that we would really care about it.

OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
would make more sense to me. I don't recall if process_madvise() allows for
that already, and if it does, is this series primarily tackling optimizing
that?

Yeah it's weird, but people can get caught out by unexpected failures to merge
if they do fun stuff with mremap().

Then again mremap() itself _mandates_ that you only span a single VMA (or part
of one) :)

Maybe some garbage collection use cases that shuffle individual pages, 
and later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.

Can we talk about the _true_ horror show that - you can span multiple VMAs _with
gaps_ and it'll allow you, only it'll return -ENOMEM at the end?

In madvise_walk_vmas():

	for (;;) {
		...

		if (start < vma->vm_start) {
			unmapped_error = -ENOMEM;
			start = vma->vm_start;
			...
		}

		...

		error = visit(vma, &prev, start, tmp, arg);
		if (error)
			return error;

		...
	}

	return unmapped_error;

So, you have no idea if that -ENOMEM is due to a gap, or do to the
operation returning an -ENOMEM?
> > I mean can we just drop this? Does anybody in their right mind rely on
this? Or is it intentional to deal with somehow a racing unmap?
> > But, no, we hold an mmap lock so that's not it.

Races could still happen if user space would do it from separate 
threads. But then, who would prevent user space from doing another 
mmap() and getting pages zapped ... so that sounds unlikely.

Yeah OK so can we drop this madness? :) or am I missing some very important
detail about why we allow this?

I stumbled over that myself a while ago. It's well documented behavior 
in the man page :(

At that point I stopped caring, because apparently somebody else cared 
enough to document that clearly in the man page :)

I guess spanning multiple VMAs we _have_ to leave in because plausibly
there are users of that out there?

Spanning multiple VMAs can probably easily happen. At least in QEMU I 
know some sane ways to trigger it on guest memory. But, all corner 
cases, so nothing relevant for performance.

--
Cheers,

David / dhildenb