On Wed, Oct 30, 2019 at 02:28:21PM -0700, Andy Lutomirski wrote: > On Wed, Oct 30, 2019 at 1:40 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > > > On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote: > > > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote: > > > > > > > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > > > > > > > > > > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > > > > > > > > > > > Hi, > > > > > > > > > > > > The patch below aims to allow applications to create mappins that have > > > > > > pages visible only to the owning process. Such mappings could be used to > > > > > > store secrets so that these secrets are not visible neither to other > > > > > > processes nor to the kernel. > > > > > > > > > > > > I've only tested the basic functionality, the changes should be verified > > > > > > against THP/migration/compaction. Yet, I'd appreciate early feedback. > > > > > > > > > > I’ve contemplated the concept a fair amount, and I think you should > > > > > consider a change to the API. In particular, rather than having it be a > > > > > MAP_ flag, make it a chardev. You can, at least at first, allow only > > > > > MAP_SHARED, and admins can decide who gets to use it. It might also play > > > > > better with the VM overall, and you won’t need a VM_ flag for it — you > > > > > can just wire up .fault to do the right thing. > > > > > > > > I think mmap()/mprotect()/madvise() are the natural APIs for such > > > > interface. > > > > > > Then you have a whole bunch of questions to answer. For example: > > > > > > What happens if you mprotect() or similar when the mapping is already > > > in use in a way that's incompatible with MAP_EXCLUSIVE? > > > > Then we refuse to mprotect()? Like in any other case when vm_flags are not > > compatible with required madvise()/mprotect() operation. > > > > I'm not talking about flags. I'm talking about the case where one > thread (or RDMA or whatever) has get_user_pages()'d a mapping and > another thread mprotect()s it MAP_EXCLUSIVE. > > > > Is it actually reasonable to malloc() some memory and then make it exclusive? > > > > > > Are you permitted to map a file MAP_EXCLUSIVE? What does it mean? > > > > I'd limit MAP_EXCLUSIVE only to anonymous memory. > > > > > What does MAP_PRIVATE | MAP_EXCLUSIVE do? > > > > My preference is to have only mmap() and then the semantics is more clear: > > > > MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked > > and drops the pages in this region from the direct map. > > The pages are returned back on munmap(). > > Then there is no way to change an existing area to be exclusive or vice > > versa. > > And what happens if you fork()? Limiting it to MAP_SHARED | > MAP_EXCLUSIVE would about this particular nasty question. > > > > > > How does one pass exclusive memory via SCM_RIGHTS? (If it's a > > > memfd-like or chardev interface, it's trivial. mmap(), not so much.) > > > > Why passing such memory via SCM_RIGHTS would be useful? > > Suppose I want to put a secret into exclusive memory and then send > that secret to some other process. The obvious approach would be to > SCM_RIGHTS an fd over, but you can't do that with MAP_EXCLUSIVE as > you've defined it. In general, there are lots of use cases for memfd > and other fd-backed memory. > > > > > > And finally, there's my personal giant pet peeve: a major use of this > > > will be for virtualization. I suspect that a lot of people would like > > > the majority of KVM guest memory to be unmapped from the host > > > pagetables. But people might also like for guest memory to be > > > unmapped in *QEMU's* pagetables, and mmap() is a basically worthless > > > interface for this. Getting fd-backed memory into a guest will take > > > some possibly major work in the kernel, but getting vma-backed memory > > > into a guest without mapping it in the host user address space seems > > > much, much worse. > > > > Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets > > rather than use it for the entire guest memory. I even considered adding a > > limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is > > anyway enforced there is no need for a new one. > > > > I agree that getting fd-backed memory into a guest would be less pain that > > VMA, but KVM can already use memory outside the control of the kernel via > > /dev/map [1]. > > That series doesn't address the problem I'm talking about at all. I'm > saying that there is a legitimate use case where QEMU should *not* > have a mapping of the memory. So QEMU would create some exclusive > memory using /dev/exclusive_memory and would tell KVM to map it into > the guest without mapping it into QEMU's address space at all. > > (In fact, the way that SEV currently works is *functionally* like > this, except that there's a bogus incoherent mapping in the QEMU > process that is a giant can of worms. > > > IMO a major benefit of a chardev approach is that you don't need a new > VM_ flag and you don't need to worry about wiring it up everywhere in > the core mm code. Ok, at last I'm starting to see your and Christoph's point. Just to reiterate, we can use fd-backed memory using /dev/exclusive_memory chardev (or some other name we'll pick after long bikeshedding) and then the .mmap method of this character device can do interesting things with the backing physical memory. Since the memory is not VMA-mapped, we do not have to find all the places in the core that might require a check of a VM_ flag to ensure there is no clashes with the exclusive memory. Still, whatever we do with the mapping properties of this memory, we need a solution to the splitting of huge pages that map the direct map, but this is an orthogonal problem in a way. -- Sincerely yours, Mike.