It looks quite hacky (well, what did I expect from an RFC :) ) you can no
longer distinguish actually poisoned pages from "temporarily poisoned"
pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous - "I want
to read/write a poisoned page, trust me, I know what I am doing".
Storing the state for each individual page initially sounded like the right
thing to do, but I wonder if we couldn't handle this on a per-VMA level. You
can just remember the handful of shared ranges internally like you do right
now AFAIU.
per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to
combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be
required. Or per-memslot. I donno. Need more experiments.
Indeed.
Note, I use PG_hwpoison now, but if we find a show-stopper issue where we
would see confusion with a real poison, we can switch to new flags and
a new swap_type(). I have not seen a reason yet.
I think we'll want a dedicate mechanism to cleanly mark pages as
"protected". Finding a page flag you can use will be the problematic
part, but should not be impossible if we have a good reason to do so
(even if it means making the feature mutually exclusive with other
features).
From what I get, you want a way to
1. Unmap pages from the user space page tables.
Plain unmap would not work for some use-cases. Some CSPs want to
preallocate memory in a specific way. It's a way to provide a fine-grained
NUMA policy.
The existing mapping has to be converted.
2. Disallow re-faulting of the protected pages into the page tables. On user
space access, you want to deliver some signal (e.g., SIGBUS).
Note that userspace mapping is the only source of pfn's for VM's shadow
mapping. The fault should be allow, but lead to non-present PTE that still
encodes pfn.
Makes sense, but I guess that's the part still to be implemented (see
next comment).
3. Allow selected users to still grab the pages (esp. KVM to fault them into
the page tables).
As long as fault leads to non-present PTEs we are fine. Usespace still may
want to mlock() some of guest memory. There's no reason to prevent this.
I'm curious, even get_user_pages() will lead to a present PTE as is, no?
So that will need modifications I assume. (although I think it
fundamentally differs to the way get_user_pages() works - trigger a
fault first, then lookup the PTE in the page tables).
4. Allow access to currently shared specific pages from user space.
Right now, you achieve
1. Via try_to_unmap()
2. TestSetPageHWPoison
3. TBD (e.g., FOLL_ALLOW_POISONED)
4. ClearPageHWPoison()
If we could bounce all writes to shared pages through the kernel, things
could end up a little easier. Some very rough idea:
We could let user space setup VM memory as
mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected
memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to
PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV
when trying to write from user space.
You could then still access the pages, e.g., via FOLL_FORCE or a new fancy
flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would
allow an ioctl to write page content and to map the pages into NPTs.
As an extension, we could think about (re?)mapping some shared pages
read|write. The question is how to synchronize with user space.
I have no idea how expensive would be bouncing writes (and reads?) through
the kernel. Did you ever experiment with that/evaluate that?
It's going to be double bounce buffer: on the guest we force swiotlb to
make it go through shared region. I don't think it's a good idea.
So if it's already slow, do we really care? ;)
There are a number of way to share a memory. It's going to be decided by
the way we get these pages unmapped in the first place.
I agree that shared memory can be somewhat problematic and would require
tracking it per page.
--
Thanks,
David / dhildenb