On Thu, May 23, 2019 at 08:38:17AM -0700, Andy Lutomirski wrote: > On Thu, May 23, 2019 at 7:17 AM Sean Christopherson > <sean.j.christopherson@xxxxxxxxx> wrote: > > > > On Thu, May 23, 2019 at 01:26:28PM +0300, Jarkko Sakkinen wrote: > > > On Wed, May 22, 2019 at 07:35:17PM -0700, Sean Christopherson wrote: > > > > But actually, there's no need to disallow mmap() after ECREATE since the > > > > LSM checks also apply to mmap(), e.g. FILE__EXECUTE would be needed to > > > > mmap() any enclave pages PROT_EXEC. I guess my past self thought mmap() > > > > bypassed LSM checks? The real problem is that mmap()'ng an existing > > > > enclave would require FILE__WRITE and FILE__EXECUTE, which puts us back > > > > at square one. > > > > > > I'm lost with the constraints we want to set. > > > > As is today, SELinux policies would require enclave loaders to have > > FILE__WRITE and FILE__EXECUTE permissions on /dev/sgx/enclave. Presumably > > other LSMs have similar requirements. Requiring all processes to have > > FILE__{WRITE,EXECUTE} permissions means the permissions don't add much > > value, e.g. they can't be used to distinguish between an enclave that is > > being loaded from an unmodified file and an enclave that is being > > generated on the fly, e.g. Graphene. > > > > Looking back at Andy's mail, he was talking about requiring FILE__EXECUTE > > to run an enclave, so perhaps it's only FILE__WRITE that we're trying to > > special case. > > > > I thought about this some more, and I have a new proposal that helps > address the ELRANGE alignment issue and the permission issue at the > cost of some extra verbosity. Maybe you all can poke holes in it :) > The basic idea is to make everything more explicit from a user's > perspective. Here's how it works: > > Opening /dev/sgx/enclave gives an enclave_fd that, by design, doesn't > give EXECUTE or WRITE. mmap() on the enclave_fd only works if you > pass PROT_NONE and gives the correct alignment. The resulting VMA > cannot be mprotected or mremapped. It can't be mmapped at all until I assume you're thinking of clearing all VM_MAY* flags in sgx_mmap()? > after ECREATE because the alignment isn't known before that. I don't follow. The alignment is known because userspace knows the size of its enclave. The initial unknown is the address, but that becomes known once the initial mmap() completes. > Associated with the enclave are a bunch (up to 7) "enclave segment I assume 7 = R, W, X, RW, RX, WX and RWX? > inodes". These are anon_inodes that are created automagically. An > enclave segment is a group of pages, not necessary contiguous, with an > upper bound on the memory permissions. Each enclave page belongs to a > segment. When you do EADD, you tell the driver what segment you're > adding to. [0] This means that EADD gets an extra argument that is a > permission mask for the page -- in addition to the initial SECINFO, > you also pass to EADD something to the effect of "I promise never to > map this with permissions greater than RX". > > Then we just need some way to mmap a region from an enclave segment. > This could be done by having a way to get an fd for an enclave segment > or it could be done by having a new ioctl SGX_IOC_MAP_SEGMENT. User > code would use this operation to replace, MAP_FIXED-style, ranges from > the big PROT_NONE mapping with the relevant pages from the enclave > segment. The resulting vma would only have VM_MAYWRITE if the segment > is W, only have VM_MAYEXEC if the segment is X, and only have > VM_MAYREAD if the segment is R. Depending on implementation details, > the VMAs might need to restrict mremap() to avoid mapping pages that > aren't part of the segment in question. If my above assumptions regarding VM_MAY* and the "7 segments" are correct, IIUC you're proposing that an LSM could have policies for each of the anon inodes, e.g. grant/deny RWX vs. RW vs RX. Am I in the ballpark? > It's plausible that this whole thing works without the magic segment > inodes under the hood, but figuring that out would need a careful look > at how all the core mm bits and LSM bits work together. > > To get all the LSM stuff to work, SELinux will need some way to > automatically assign an appropriate label to the segment inodes. I > assume that such a mechanism already exists and gets used for things > like sockets, but I haven't actually confirmed this. I (obviously) don't fully understand your proposal, but I don't think we want to hook inodes, e.g. AppArmor doesn't implement inode_permission() but does implement file_mprotect() and mmap_file(), which feel like the natural hooks for this sort of thing. I also think it's overkill, e.g. AppArmor doesn't have a concept of EXECMOD, EXECMEM, EXECHEAP, etc.., so I don't think we need to go beyond detecting W+X scenarios. Starting with your original idea of tracking "safe to execute" and Cedric's of propagating the permissions from the source VMA, but tweaked with your new idea of clearing VM_MAY* and a custom MAP_FIXED/mprotect(). Add SGX_IOC_MPROTECT (or SGX_IOC_MAP_REGION?) that works as follows: 1. Track VM_MAY{READ,WRITE,EXEC} flags for each enclave page. 2. SGX_IOC_ADD_REGION, i.e. EADD, initializes the VM_MAY* flags for each enclave page based on the source VMA. 3. sgx_mmap() only works with PROT_NONE, skips alignment stuff if MAP_FIXED, and clears VM_MAY{READ,WRITE,EXEC}. 4. mprotect() on /dev/sgx/enclave doesn't work because the VMA doesn't have any VM_MAY{READ,WRITE,EXEC} capabilities. 5. Deny mremap() post-ECREATE as the address and size of the enclave are fixed at ECREATE (in hardware). 6. SGX_IOC_MPROTECT works like normal mprotect(), except the VM_MAY* flags are pulled from the enclave pages, and its call to security_file_mprotect() is VM_READ|VM_EXEC by default. The LSM call sets VM_WRITE iff the enclave page has both VM_MAYWRITE and VM_MAYEXEC. The idea here is to require READ and EXECUTE to run an enclave, and only require WRITE on /dev/sgx/enclave when the enclave can execute modified memory. To support SGX2 down the road, which will want to convert a page to executable on the fly, we could add: 7. SGX_IOC_EXTEND_PERMISSIONS enables userspace to extend the VM_MAY* flags for an enclave page, e.g. to make a page executable. SGX_IOC_MPROTECT is still required to actually map the page. Notably, adding a RW page to the enclave, e.g. to grow its heap, doesn't require WRITE, whereas adding a RWX page, e.g. for dynamic loading, would require WRITE. This can only extend! E.g. userspace can't circumvent the WRITE requirement by clearing VM_MAYWRITE. Note, FILE__WRITE on /dev/sgx/enclave is essentially equivalent to FILE__EXECMOD. Using FILE__WRITE in this way means there are no changes to SELinux (triggering FILE__EXECMOD would be awkward), and AppArmor also picks up extra protections for enclaves. > [0] There needs to be some vaguely intelligent semantics if you EADD > the *same* address more than once. A simple solution would be to > disallow it if the segments don't match. I don't see any reason to allow duplicate EADD as it serves no purpose, e.g. doing so changes the enclave's measurement and that's it.