On 05/09/20 01:17, Andy Lutomirski wrote: > There's sev_pin_memory(), so QEMU must have at least some idea of > which memory could potentially be encrypted. Is it in fact the case > that QEMU doesn't know that some SEV pinned memory might actually be > used for DMA until the guest tries to do DMA on that memory? If so, > yuck. Yes. All the memory is pinned, all the memory could potentially be used for DMA (of garbage if it's encrypted). And it's the same for pretty much all protected VM extensions (SEV, POWER, s390, Intel TDX). >> The primary VM and the enclave VM(s) would each get a different memory >> access file descriptor. QEMU would treat them no differently from any >> other externally-provided memory backend, say hugetlbfs or memfd, so >> yeah they would be mmap-ed to userspace and the host virtual address >> passed as usual to KVM. > > Would the VM processes mmap() these descriptors, or would KVM learn > how to handle that memory without it being mapped? The idea is that the process mmaps them, QEMU would treat them just the same as a hugetlbfs file descriptor for example. >> The manager can decide at any time to hide some memory from the parent >> VM (in order to give it to an enclave). This would actually be done on >> request of the parent VM itself [...] But QEMU is >> untrusted, so the manager cannot rely on QEMU behaving well. Hence the >> privilege separation model that was implemented here. > > How does this work? Is there a revoke mechanism, or does the parent > just munmap() the memory itself? The parent has ioctls to add and remove memory from the pidfd-mem. So unmapping is just calling the ioctl that removes a range. >> So what you are suggesting is that KVM manages its own address space >> instead of host virtual addresses (and with no relationship to host >> virtual addresses, it would be just a "cookie")? > > [...] For this pidfd-mem scheme in particular, it might avoid the nasty > corner case I mentioned. With pidfd-mem as in this patchset, I'm > concerned about what happens when process A maps some process B > memory, process B maps some of process A's memory, and there's a > recursive mapping that results. Or when a process maps its own > memory, for that matter. > > Or memfd could get fancier with operations to split memfds, remove > pages from memfds, etc. Maybe that's overkill. Doing it directly with memfd is certainly an option, especially since MFD_HUGE_* exists. Basically you'd have a system call to create a secondary view of the memfd, and the syscall interface could still be very similar to what is in this patch, in particular the control/access pair. Probably this could be used also to implement Matthew Wilcox's ideas. I still believe that the pidfd-mem concept has merit as a "capability-like" PTRACE_{PEEK,POKE}DATA replacement, but it would not need any of privilege separation or mmap support, only direct read/write. So there's two concepts mixed in one interface in this patch, with two completely different usecases. Merging them is clever, but perhaps too clever. I can say that since it was my idea. :D Thanks, Paolo