On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: > > > On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: > > On Mon, Apr 04, 2022, Quentin Perret wrote: > >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: > >> FWIW, there are a couple of reasons why I'd like to have in-place > >> conversions: > >> > >> - one goal of pKVM is to migrate some things away from the Arm > >> Trustzone environment (e.g. DRM and the likes) and into protected VMs > >> instead. This will give Linux a fighting chance to defend itself > >> against these things -- they currently have access to _all_ memory. > >> And transitioning pages between Linux and Trustzone (donations and > >> shares) is fast and non-destructive, so we really do not want pKVM to > >> regress by requiring the hypervisor to memcpy things; > > > > Is there actually a _need_ for the conversion to be non-destructive? > > E.g. I assume > > the "trusted" side of things will need to be reworked to run as a pKVM > > guest, at > > which point reworking its logic to understand that conversions are > > destructive and > > slow-ish doesn't seem too onerous. > > > >> - it can be very useful for protected VMs to do shared=>private > >> conversions. Think of a VM receiving some data from the host in a > >> shared buffer, and then it wants to operate on that buffer without > >> risking to leak confidential informations in a transient state. In > >> that case the most logical thing to do is to convert the buffer back > >> to private, do whatever needs to be done on that buffer (decrypting a > >> frame, ...), and then share it back with the host to consume it; > > > > If performance is a motivation, why would the guest want to do two > > conversions > > instead of just doing internal memcpy() to/from a private page? I > > would be quite > > surprised if multiple exits and TLB shootdowns is actually faster, > > especially at > > any kind of scale where zapping stage-2 PTEs will cause lock contention > > and IPIs. > > I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard. So maybe it's not so bad, at least in very simple cases, ignoring all implementation details. (But see below.) Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there. Yep. I can try and do some measurements if that's really necessary, but I'm really convinced the cost of the TLBI for the shared->private conversion is going to be significantly smaller than the cost of memcpy the buffer twice in the guest for us. To be fair, although the cost for the CPU update is going to be low, the cost for IOMMU updates _might_ be higher, but that very much depends on the hardware. On systems that use e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and the iotlb invalidation is done on the back of the CPU invalidation. So, on systems with sane hardware the overhead is *really* quite small. Also, memcpy requires double the memory, it is pretty bad for power, and it causes memory traffic which can't be a good thing for things running concurrently. > If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation. Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done. The other direction is a mess: all existing uses of the page need to be torn down. If the page has been recently used for DMA, this includes IOMMU entries. > > Quentin: let's ignore any API issues for now. Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle? I had a high level idea for the workflow, but I haven't looked into the implementation details. The idea would be to allow KVM *or* userspace to take a reference to a page in the fd in an exclusive manner. KVM could take a reference on a page (which would be necessary before to donating it to a guest) using some kind of memfile_notifier as proposed in this series, and userspace could do the same some other way (mmap presumably?). In both cases, the operation might fail. I would imagine the boot and private->shared flow as follow: - the VMM uses fallocate on the private fd, and associates the <fd, offset, size> with a memslot; - the guest boots, and as part of that KVM takes references to all the pages that are donated to the guest. If userspace happens to have a mapping to a page, KVM will fail to take the reference, which would be fatal for the guest. - once the guest has booted, it issues a hypercall to share a page back with the host; - KVM is notified, and at that point it drops its reference to the page. It then exits to userspace to notify it of the share; - host userspace receives the share, and mmaps the shared page with MAP_FIXED to access it, which takes a reference on the fd-backed page. There are variations of that idea: e.g. allow userspace to mmap the entire private fd but w/o taking a reference on pages mapped with PROT_NONE. And then the VMM can use mprotect() in response to share/unshare requests. I think Marc liked that idea as it keeps the userspace API closer to normal KVM -- there actually is a straightforward gpa->hva relation. Not sure how much that would impact the implementation at this point. For the shared=>private conversion, this would be something like so: - the guest issues a hypercall to unshare a page; - the hypervisor forwards the request to the host; - the host kernel forwards the request to userspace; - userspace then munmap()s the shared page; - KVM then tries to take a reference to the page. If it succeeds, it re-enters the guest with a flag of some sort saying that the share succeeded, and the hypervisor will adjust pgtables accordingly. If KVM failed to take a reference, it flags this and the hypervisor will be responsible for communicating that back to the guest. This means the guest must handle failures (possibly fatal). (There are probably many ways in which we can optimize this, e.g. by having the host proactively munmap() pages it no longer needs so that the unshare hypercall from the guest doesn't need to exit all the way back to host userspace.) A nice side-effect of the above is that it allows userspace to dump a payload in the private fd before booting the guest. It just needs to mmap the fd, copy what it wants in there, munmap, and only then pass the fd to KVM which will be happy enough as long as there are no current references to the pages. Note: in a previous email I've said that Android doesn't need this (which is correct as our guest bootloader currently receives the payload over virtio) but this might change some day, and there might be other implementations as well, so it's a nice bonus if we can make this work. > The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place. Right, that sounds reasonable to me. > Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared". pread() and pwrite() would work on those pages, but not mmap(). (Or maybe mmap() but the resulting mappings would not permit GUP.) And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV. Aha, didn't think of pread()/pwrite(). Very interesting. I'd need to check what our VMM actually does, but as an initial reaction it feels like this might require a pretty significant rework in userspace. Maybe it's a good thing? Dunno. Maybe more important, those shared pages are used for virtio communications, so the cost of issuing syscalls every time the VMM needs to access the shared page will need to be considered... Thanks, Quentin