On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote: > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > > > > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > > > > > Hi, > > > > > > The patch below aims to allow applications to create mappins that have > > > pages visible only to the owning process. Such mappings could be used to > > > store secrets so that these secrets are not visible neither to other > > > processes nor to the kernel. > > > > > > I've only tested the basic functionality, the changes should be verified > > > against THP/migration/compaction. Yet, I'd appreciate early feedback. > > > > I’ve contemplated the concept a fair amount, and I think you should > > consider a change to the API. In particular, rather than having it be a > > MAP_ flag, make it a chardev. You can, at least at first, allow only > > MAP_SHARED, and admins can decide who gets to use it. It might also play > > better with the VM overall, and you won’t need a VM_ flag for it — you > > can just wire up .fault to do the right thing. > > I think mmap()/mprotect()/madvise() are the natural APIs for such > interface. Then you have a whole bunch of questions to answer. For example: What happens if you mprotect() or similar when the mapping is already in use in a way that's incompatible with MAP_EXCLUSIVE? Is it actually reasonable to malloc() some memory and then make it exclusive? Are you permitted to map a file MAP_EXCLUSIVE? What does it mean? What does MAP_PRIVATE | MAP_EXCLUSIVE do? How does one pass exclusive memory via SCM_RIGHTS? (If it's a memfd-like or chardev interface, it's trivial. mmap(), not so much.) And finally, there's my personal giant pet peeve: a major use of this will be for virtualization. I suspect that a lot of people would like the majority of KVM guest memory to be unmapped from the host pagetables. But people might also like for guest memory to be unmapped in *QEMU's* pagetables, and mmap() is a basically worthless interface for this. Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse. > Switching to a chardev doesn't solve the major problem of direct > map fragmentation and defeats the ability to use exclusive memory mappings > with the existing allocators, while mprotect() and madvise() do not. > Will people really want to do malloc() and then remap it exclusive? This sounds dubiously useful at best.