On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote: > === Summary === > > An SGX VMA can only be created if its permissions are the same or > weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA > creation this same rule is again enforced by the page fault handler: > faulted enclave pages are required to have equal or more relaxed > EPCM permissions than the VMA permissions. > > On SGX1 systems the additional enforcement in the page fault handler > is redundant and on SGX2 systems it incorrectly prevents access. > On SGX1 systems it is unnecessary to repeat the enforcement of the > permission rule. The rule used during original VMA creation will > ensure that any access attempt will use correct permissions. > With SGX2 the EPCM permissions of a page can change after VMA > creation resulting in the VMA permissions potentially being more > relaxed than the EPCM permissions and the page fault handler > incorrectly blocking valid access attempts. > > Enable the VMA's pages to remain accessible while ensuring that > the PTEs are installed to match the EPCM permissions but not be > more relaxed than the VMA permissions. > > === Full Changelog === > > An SGX enclave is an area of memory where parts of an application > can reside. First an enclave is created and loaded (from > non-enclave memory) with the code and data of an application, > then user space can map (mmap()) the enclave memory to > be able to enter the enclave at its defined entry points for > execution within it. > > The hardware maintains a secure structure, the Enclave Page Cache Map > (EPCM), that tracks the contents of the enclave. Of interest here is > its tracking of the enclave page permissions. When a page is loaded > into the enclave its permissions are specified and recorded in the > EPCM. In parallel the kernel maintains permissions within the > page table entries (PTEs) and the rule is that PTE permissions > are not allowed to be more relaxed than the EPCM permissions. > > A new mapping (mmap()) of enclave memory can only succeed if the > mapping has the same or weaker permissions than the permissions that > were vetted during enclave creation. This is enforced by > sgx_encl_may_map() that is called on the mmap() as well as mprotect() > paths. This rule remains. > > One feature of SGX2 is to support the modification of EPCM permissions > after enclave initialization. Enclave pages may thus already be part > of a VMA at the time their EPCM permissions are changed resulting > in the VMA's permissions potentially being more relaxed than the EPCM > permissions. > > Allow permissions of existing VMAs to be more relaxed than EPCM > permissions in preparation for dynamic EPCM permission changes > made possible in SGX2. New VMAs that attempt to have more relaxed > permissions than EPCM permissions continue to be unsupported. > > Reasons why permissions of existing VMAs are allowed to be more relaxed > than EPCM permissions instead of dynamically changing VMA permissions > when EPCM permissions change are: > 1) Changing VMA permissions involve splitting VMAs which is an > operation that can fail. Additionally changing EPCM permissions of > a range of pages could also fail on any of the pages involved. > Handling these error cases causes problems. For example, if an > EPCM permission change fails and the VMA has already been split > then it is not possible to undo the VMA split nor possible to > undo the EPCM permission changes that did succeed before the > failure. > 2) The kernel has little insight into the user space where EPCM > permissions are controlled from. For example, a RW page may > be made RO just before it is made RX and splitting the VMAs > while the VMAs may change soon is unnecessary. > > Remove the extra permission check called on a page fault > (vm_operations_struct->fault) or during debugging > (vm_operations_struct->access) when loading the enclave page from swap > that ensures that the VMA permissions are not more relaxed than the > EPCM permissions. Since a VMA could only exist if it passed the > original permission checks during mmap() and a VMA may indeed > have more relaxed permissions than the EPCM permissions this extra > permission check is no longer appropriate. > > With the permission check removed, ensure that PTEs do > not blindly inherit the VMA permissions but instead the permissions > that the VMA and EPCM agree on. PTEs for writable pages (from VMA > and enclave perspective) are installed with the writable bit set, > reducing the need for this additional flow to the permission mismatch > cases handled next. > > Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > --- > Changes since V1: > - Reword commit message (Jarkko). > - Use "relax" instead of "exceed" when referring to permissions (Dave). > - Add snippet to Documentation/x86/sgx.rst that highlights the > relationship between VMA, EPCM, and PTE permissions on SGX > systems (Andy). > > Documentation/x86/sgx.rst | 10 +++++++++ > arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++---------------- > 2 files changed, 30 insertions(+), 18 deletions(-) > > diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst > index 89ff924b1480..5659932728a5 100644 > --- a/Documentation/x86/sgx.rst > +++ b/Documentation/x86/sgx.rst > @@ -99,6 +99,16 @@ The relationships between the different permission masks are: > * PTEs are installed to match the EPCM permissions, but not be more > relaxed than the VMA permissions. > > +On systems supporting SGX2 EPCM permissions may change while the > +enclave page belongs to a VMA without impacting the VMA permissions. > +This means that a running VMA may appear to allow access to an enclave > +page that is not allowed by its EPCM permissions. For example, when an > +enclave page with RW EPCM permissions is mapped by a RW VMA but is > +subsequently changed to have read-only EPCM permissions. The kernel > +continues to maintain correct access to the enclave page through the > +PTE that will ensure that only access allowed by both the VMA > +and EPCM permissions are permitted. > + > Application interface > ===================== > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 48afe96ae0f0..b6105d9e7c46 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, > } > > static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > - unsigned long addr, > - unsigned long vm_flags) > + unsigned long addr) > { > - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > struct sgx_epc_page *epc_page; > struct sgx_encl_page *entry; > > @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > if (!entry) > return ERR_PTR(-EFAULT); > > - /* > - * Verify that the faulted page has equal or higher build time > - * permissions than the VMA permissions (i.e. the subset of {VM_READ, > - * VM_WRITE, VM_EXECUTE} in vma->vm_flags). > - */ > - if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits) > - return ERR_PTR(-EFAULT); > - > /* Entry successfully located. */ > if (entry->epc_page) { > if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > { > unsigned long addr = (unsigned long)vmf->address; > struct vm_area_struct *vma = vmf->vma; > + unsigned long page_prot_bits; > struct sgx_encl_page *entry; > + unsigned long vm_prot_bits; > unsigned long phys_addr; > struct sgx_encl *encl; > vm_fault_t ret; > @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > > mutex_lock(&encl->lock); > > - entry = sgx_encl_load_page(encl, addr, vma->vm_flags); > + entry = sgx_encl_load_page(encl, addr); > if (IS_ERR(entry)) { > mutex_unlock(&encl->lock); > @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > > phys_addr = sgx_get_epc_phys_addr(entry->epc_page); > > - ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr)); > + /* > + * Insert PTE to match the EPCM page permissions ensured to not > + * exceed the VMA permissions. > + */ > + vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > + page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits; > + /* > + * Add VM_SHARED so that PTE is made writable right away if VMA > + * and EPCM are writable (no COW in SGX). > + */ > + page_prot_bits |= (vma->vm_flags & VM_SHARED); > + ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr), > + vm_get_page_prot(page_prot_bits)); > if (ret != VM_FAULT_NOPAGE) { > mutex_unlock(&encl->lock); > > @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag > * Load an enclave page to EPC if required, and take encl->lock. > */ > static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl, > - unsigned long addr, > - unsigned long vm_flags) > + unsigned long addr) > { > struct sgx_encl_page *entry; > > for ( ; ; ) { > mutex_lock(&encl->lock); > > - entry = sgx_encl_load_page(encl, addr, vm_flags); > + entry = sgx_encl_load_page(encl, addr); > if (PTR_ERR(entry) != -EBUSY) > break; > > @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr, > return -EFAULT; > > for (i = 0; i < len; i += cnt) { > - entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK, > - vma->vm_flags); > + entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK); > if (IS_ERR(entry)) { > ret = PTR_ERR(entry); > break; > -- > 2.25.1 > If you unconditionally set vm_max_prot_bits to RWX for dynamically created pags, you would not need to do this. These patches could be then safely dropped then: - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions And that would also keep full ABI compatibility without exceptions to the existing mainline code. BR, Jarkko