Hi, Sileby looks interesting! I had just written up the following idea which seems similar but includes a mechanism for revoking mappings. Alexander Graf recently brought up an idea that solves the following problem: When process A passes shared memory file descriptors to process B there is no way for process A to revoke access or change page protection bits after passing the fd. I'll describe the idea (not sure if it's exactly what Alexander had in mind). Memory view driver ------------------ The memory view driver allows process A to control the page table entries of an mmap in process B. It is a character device driver that process A opens like this: int fd = open("/dev/memory-view", O_RDWR); This returns a file descriptor to a new memory view. Next process A sets the size of the memory view: ftruncate(fd, 16 * GiB); The size determines how large the memory view will be. The size is a virtual memory concept and does not consume resources (there is no physical memory backing this). Process A populates the memory view with ranges from file descriptors it wishes to share. The file descriptor can be a shared memory file descriptor: int memfd = memfd_create("guest-ram, 0); ftruncate(memfd, 32 * GiB); /* Map [8GB, 10GB) at 8GB into the memory view */ struct memview_map_fd_info = { .fd = memfd, .fd_offset = 8 * GiB, .size = 2 * GiB, .mem_offset = 8 * GiB, .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE, }; ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info); It is also possible to populate the memory view from the page cache: int filefd = open("big-file.iso", O_RDONLY); /* Map [4GB, 12GB) at 0B into the memory view */ struct memview_map_fd_info = { .fd = filefd, .fd_offset = 4 * GiB, .size = 8 * GiB, .mem_offset = 0, .flags = MEMVIEW_MAP_READ, }; ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info); The memory view has now been populated like this: Range (GiB) Fd Permissions 0-8 big-file.iso read 8-10 guest-ram read+write 10-16 <none> <none> Now process A gets the "view" file descriptor for this memory view. The view file descriptor does not allow ioctls. It can be safely passed to process B in the knowledge that process B can only mmap or close it: int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD); ...pass viewfd to process B... Process B receives viewfd and mmaps it: void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED, viewfd, 0); When process B accesses a page in the mmap region the memory view driver resolves the page fault by checking if the page is mapped to an fd and what its permissions are. For example, accessing the page at 4GB from the start of the memory view is an access at 8GB into big-file.iso. That's because 8GB of big-file.iso was mapped at 0 with fd_offset 4GB. To summarize, there is one vma in process B and the memory view driver maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by process A. Page protection bits are the AND of the mmap PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the mapping in question. Does vmf_insert_mixed_prot() or a similar Linux API allow this? Can the memory view driver map pages from fds without pinning the pages? Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires zapping affected process B ptes. When process B accesses those pages again the fault handler will handle the page fault based on the latest memory view layout. If process B accesses a page with incorrect permissions or that has not been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is raised. When process B uses mprotect(2) and other virtual memory syscalls it is unable to increase page permissions. Instead it can only reduce them because the pte protection bits are the AND of the mmap flags and the memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags. Use cases --------- How to use the memory view driver for vhost-user ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vhost-user and other out-of-process device emulation interfaces need a way for the VMM to enforce the IOMMU mappings on the device emulation process. Today the VMM passes all guest RAM fds to the device emulation process and has no way of restricting access or revoking it later. With the memory view driver the VMM will pass one or more memory view fds instead of the actual guest RAM fds. This allows the VMM to invoke ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke access. How to use the memory view driver for virtio-fs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The virtiofsd vhost-user process creates a memory view for the device's DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory region so that the guest directly accesses the memory view. Now virtiofsd can map portions of files into the DAX Window without coordinating with the QEMU process. This simplifies the virtio-fs code and should also improve DAX map/unmap performance. Stefan
Attachment:
signature.asc
Description: PGP signature