Re: folio_mmapped

David Hildenbrand <david@xxxxxxxxxx> · Fri, 1 Mar 2024 12:16:54 +0100

I don't think that we can assume that only a single VMA covers a page.

But of course, no rmap walk is always better.

We've been thinking some more about how to handle the case where the
host userspace has a mapping of a page that later becomes private.

One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
to the host with a meaningful exit reason) until the host unmaps that
page, and check for the refcount to the page as you mentioned earlier.
This is essentially what the RFC I sent does (minus the bugs :) ) .

The other idea is to use the rmap walk as you suggested to zap that
page. If the host tries to access that page again, it would get a
SIGBUS on the fault. This has the advantage that, as you'd mentioned,
the host doesn't need to constantly mmap() and munmap() pages. It
could potentially be optimised further as suggested if we have a
cooperating VMM that would issue a MADV_DONTNEED or something like
that, but that's just an optimisation and we would still need to have
the option of the rmap walk. However, I was wondering how practical
this idea would be if more than a single VMA covers a page?

Agree with all your points here. I changed Gunyah's implementation to do
the unmap instead of erroring out. I didn't observe a significant
performance difference. However, doing unmap might be a little faster
because we can check folio_mapped() before doing the rmap walk. When
erroring out at mmap() level, we always have to do the walk.

Right. On the mmap() level you won't really have to walk page tables, as 
the the munmap() already zapped the page and removed the "problematic" VMA.

Likely, you really want to avoid repeatedly calling mmap()+munmap() just 
to access shared memory; but that's just my best guess about your user 
space app :)

Also, there's the question of what to do if the page is gupped? In
this case I think the only thing we can do is refuse to run the guest
until the gup (and all references) are released, which also brings us
back to the way things (kind of) are...

If there are gup users who don't do FOLL_PIN, I think we either need to
fix them or live with possibility here? We don't have a reliable
refcount for a folio to be safe to unmap: it might be that another vCPU
is trying to get the same page, has incremented the refcount, and
waiting for the folio_lock.

Likely there could be a way to detect that when only the vCPUs are your 
concern? But yes, it's nasty.

(has to be handled in either case :()

Disallowing any FOLL_GET|FOLL_PIN could work. Not sure how some 
core-kernel FOLL_GET users would react to that, though.

See vma_is_secretmem() and folio_is_secretmem() in mm/gup.c, where we 
disallow any FOLL_GET|FOLL_PIN of secretmem pages.

We'd need a way to teach core-mm similarly about guest_memfd, which 
might end up rather tricky, but not impossible :)

This problem exists whether we block the
mmap() or do SIGBUS.

There is work on doing more conversion to FOLL_PIN, but some cases are 
harder to convert. Most of O_DIRECT should be using it nowadays, but 
some other known use cases don't.

The simplest and readily-available example is still vmsplice(). I don't 
think it was fixed yet to use FOLL_PIN.

Use vmsplice() to pin the page in the pipe (read-only). Unmap the VMA. 
You can read the page any time later by reading from the pipe.

So I wouldn't bet on all relevant cases being gone in the near future.

--
Cheers,

David / dhildenb